Roland memisevic

Roland Memisevic

Watch Roland's talk at the 3rd Research and Applied AI Summit

courtesy of

Moritz mueller freitag

Moritz Müller-Freitag

Learning about the world through video

At TwentyBN, we build AI systems that enable a human-like visual understanding of the world. Today, we are releasing two large-scale video datasets (256,591 labeled videos) to teach machines visual common sense. The first dataset allows machines to develop a fine-grained understanding of basic actions that occur in the physical world. The second dataset of dynamic hand gestures enables robust cognition models for human-computer interaction.

Video is becoming ubiquitous

Video is playing an increasingly important role in our lives. As consumers, we collectively spend hundreds of millions of hours every day watching and sharing videos on services like YouTube, Facebook or Snapchat. When we are not busy gobbling up video on social media, we produce more of it with our smartphones, GoPro cameras and (soon) AR goggles. As a growing fraction of the planet’s population is documenting their lives in video format, we are transitioning from starring in our own magazine (the still image era) to starring in our own reality TV show.

Consumer media

All that is arguably just the beginning. The next few years will see a proliferation of connected devices, ranging from smart, always-on home cameras to autonomous vehicles. Many of these devices will rely on a camera as the primary sensory input for understanding and navigating the world. As the technological evolution marches on, video intelligence will be crucial. It is quite clear that there are not enough eyeballs in the world to process all video data and human-powered visual understanding unfortunately does not scale. What we need is a software layer that can analyze and extract meaning from video. That requires learning algorithms that understand the physical world and the myriad of actions that are carried out by the actors within it.

Video is the next frontier in computer vision

Deep Learning has made historic progress in recent years by producing systems that rival — and in some cases exceed — human performance in tasks such as recognizing objects in still images. Despite this progress, enabling computers to understand both the spatial and temporal aspects of video remains an unsolved problem. The reason is sheer complexity. While a photo is just one static image, a video shows narrative in motion. Video is time-consuming to annotate manually, and it is computationally expensive to store and process.

The main obstacle that prevents neural networks from reasoning more fundamentally about complex scenes and situations is their lack of common sense knowledge about the physical world. Video data contains a wealth of fine-grained information about the world as it shows how objects behave by virtue of their properties. For example, videos implicitly encode physical information like three-dimensional geometry, material properties, object permanence, affordance or gravity. While we humans intuitively grasp these concepts, a detailed understanding of the physical world is still largely missing from current applications in artificial intelligence (AI) and robotics.

Karpathy feifei
Existing computer vision systems produce (at best) descriptions of the world that are not robust. Here are a couple of examples produced by a model that generates natural language descriptions of images (Source: Karpathy & Fei-Fei )

At TwentyBN, we believe that intelligent software for handling video is a prerequisite for enabling the most promising applications of AI in the real world. One field of applications that we are excited about is health care, and in particular elderly care. When it comes to eldercare, changes in the activities of basic living (ADLs) often precede physiological changes and can predict poor clinical outcomes. Imagine how much we could improve the care for our aged loved ones if it were possible to install a handful of smart camera devices in fixed locations to monitor changes in activities of seniors, aid their memory, and ultimately improve their health?

To enable applications like these, we need a technology step change. We need systems that can understand the context and actions occurring in visual scenes. State-of-the-art image recognition just won’t cut it. This is because life is more than a sequence of snapshots, and perceiving the world is more than recognizing cats and dogs in images. It’s about what is actually happening in the physical world as time unfolds. It’s about verbs, not just nouns.

A novel approach to video understanding

One of the most important rate limiting factors for advancing video understanding is the lack of large and diverse real-world video datasets. Many video datasets that have been published to date suffer from a number of shortcomings: they are often weakly labeled, lack variety, or underwent a high degree of editing and post-processing. A few notable exceptions, like DeepMind’s recently released Kinetics dataset, try to alleviate this by focusing on shorter clips, but since they show high-level human activities taken from YouTube videos, they fall short of representing the simplest physical object interactions, that will be needed for modeling visual common sense.

At TwentyBN, we have spent the past year building a foundational data layer for the understanding of physical actions. Our approach is based on a single, rather straightforward idea: Why not leverage the amazingly precise and cultivated motor skills that human beings possess to generate fine-grained, complex and varied data at scale? After all, the vast majority of motion patterns that we observe day-to-day are actually caused by other humans.

To generate the complex, labelled videos that neural networks need to learn, we use what we call “crowd acting”: We instruct crowd workers to record short video clips, based on carefully predefined and highly specific descriptions: “Pushing something until it falls of the table”, “Moving object A closer to object B”, or “Sliding two fingers of your left hand up”. While we collect data of many different kinds of human action, we naturally stress dexterous manipulation of objects using one or both hands, simply because our hands are best at generating the highly controlled, complex motion patterns needed for training networks. Instead of painstakingly labeling existing video data, crowd acting allows us to generate large amounts of densely labeled, meaningful video segments at low cost.

Today, we are excited to announce the release of two substantial snapshots from our data collection campaign: A database of human-object interactions (Something-something) and the world’s largest video dataset for classifying dynamic hand gestures (Jester). The two datasets are “snapshots” because data collection is an ongoing process. In total, we are releasing 256,591 labeled video clips for supervised training of deep learning models. Both datasets are made available under a Creative Commons Attribution 4.0 International license (CC BY-NC-ND 4.0) and are freely available for academic use.

1. The “Something-something” dataset

This snapshot contains 108,499 annotated video clips, each between 2 and 6 seconds in duration. The videos show objects and the actions performed on them across 175 classes. The captions are textual descriptions based on templates, such as “Dropping something into something”. The templates contain slots of “something” that serve as placeholders for objects. This provides added structure between the text-to-video encoding for the network to improve learning.

The goal of this dataset is not only to detect or track objects in videos but to decipher the behavior of human actors as well as the direct and indirect manipulations of the objects they interact with. Predicting the textual labels from the videos therefore requires strong visual features that are capable of representing a wealth of physical properties of the objects and the world. This includes information about properties like spatial relations and material properties.

Something something

2. The “Jester” dataset

This snapshot contains 148,092 annotated video clips, each about 3 seconds long. The videos cover 25 classes of human hand gestures as well as two “no gesture” classes to help the network distinguish between specific gestures and unknown hand movements. The videos show human actors performing generic hand gestures in front of a webcam, such as “Swiping Left/Right,” “Sliding Two Fingers Up/Down,” or “Rolling Hand Forward/Backward.” Predicting these textual labels from the videos requires a network that is capable of grasping such concepts as the degrees of freedom in three-dimensional space (surging, swaying, heaving, etc).


Traditional gesture recognition systems require special hardware like stereo cameras or depth sensors such as time-of-flight cameras. Using our Jester dataset, we were able to train a neural network that can detect and classify all 25 gestures from raw RGB input with a test accuracy of 82%. The system runs in real-time on a variety of embedded platforms using video input from a webcam.

The key characteristics of both datasets are:

  • Supervised learning: In contrast to other methods that seek to acquire common sense through the use of predictive unsupervised learning, we phrase the task as a supervised learning problem. This makes the representation learning task more tractable and defined.
  • Dense captioning: The labels describe video content that is restricted to a short time interval. This ensures there is a tight synchronization between the video content and the corresponding caption.
  • Crowd-acted videos: In contrast to other academic datasets that source and annotate video clips from YouTube, we created our datasets using “crowd acting”. Our proprietary crowd acting platform allows us to ask crowd workers to provide videos given caption templates instead of the other way around. This facilitates the generation of labeled recordings rather than just the labeling of existing videos.
  • Human focused: With the exception of motion “textures” like ocean waves or leaves in the wind, most complex motion patterns that we ever see are human caused. Our datasets are human-centered to have the complex spatio-temporal patterns that encode features of articulation, degrees of freedom, etc.
  • Natural video scenes: Our videos were captured with many different devices and varying zoom factors. The datasets feature scenes with natural lighting, partial occlusions, motion blur and background noise. This assures that the datasets can transfer to real-world use cases with minimal domain shift.

The videos clips in our datasets are challenging because they capture the messiness of the real world. To give you a flavor, take a look at this video clip from the Jester dataset, showing a person performing a hand gesture:

Swiping left

While the hand gesture is visible to the human eye, it is difficult to recognize for a computer because the video footage contains sub-optimal lighting conditions and background noise (cat walking through the scene). Training on Jester forces the neural network to learn the relevant visual cues, or hierarchical features, to separate the signal (hand motion) from the noise (background motion). Basic motion detection will not be sufficient.

The practical use of visual common sense

How do we go from an understanding of physical concepts to offering practical, real-world solutions? We believe the answer is to be found in a technical concept called transfer learning.

We humans are good at thinking by analogy. Taking an idea from one domain and applying it to another is “the fuel and fire of our thinking,” as Douglas Hofstadter puts it. In AI, a step towards reasoning by analogy is transfer learning. With transfer learning, we can take a neural network trained on Something-something and Jester, and transfer its capabilities to contribute to specific business applications. Specifically, networks that have internalized representations of how objects interact in the real world can transfer this intrinsic knowledge to solve problems of higher order complexity that are predicated on these fundamental concepts.

The real world is messy and contains an infinite number of scenarios. In the absence of sufficient labeled data, building robust computer vision systems with deep learning is nearly impossible. Luckily, transfer learning has been shown to achieve astounding results on a wide array of image-based vision tasks (see here, here and here ). We believe that similar breakthroughs are imminent for deep video understanding. However, the prerequisite for leveraging transfer learning in the video domain is the availability of high-quality, labeled video data that allows neural networks to model visual common sense. This is the mission we signed up for at TwentyBN. Data collection at our company spans a spectrum from hard to recognize, but solvable (with existing proof points, like hand gesture recognition), to very hard and still not solvable. The ultimate endpoint of this spectrum is general AI.

How to get the data and where to benchmark your results

The two datasets are available for download on our website. You can find more information about the datasets and the technical specifics of our research in this technical report. If you want to benchmark the accuracy that your very own model achieves on the datasets, you will be able to upload your results to our website to be ranked in a leaderboard.

Our goal in releasing these two datasets is to catalyze the development of machines that can perceive the world like humans. Our work builds on the foundations of researchers past and present. We are committed to give back for the benefit of this vibrant community.

.   .   .

Thanks to the whole team for their awesome work. Thanks to Nathan Benaich for proof reading this article.