OCTOBER 14, 2021

Egocentric Live 4D Perception (Ego4D)

Ego4D is a collaborative project, seeking to advance the fundamental AI research needed for multimodal machine perception for first-person video understanding.

Something Went Wrong
We're having trouble playing this video.


Today’s perception systems excel at detecting and labeling objects in individual Internet photos or videos. In contrast, first-person or “egocentric” perception requires understanding ongoing sensory data (images, video, audio, and motion) as it streams to a person’s wearable, head-mounted device. It demands the integration of this multimodal data with 3D understanding of physical environments, social contexts, and human-object interactions. Furthermore, whereas users today actively take their photos—framing them intentionally to convey a message or capture a memory—images collected by wearable cameras lack this curation, presenting a much greater challenge for algorithms trying to understand them. Motivated by these contrasts, Facebook AI brought together 13 universities and academic research organizations from around the world to embark on an ambitious, long-term project, called “Egocentric Live 4D Perception” (Ego4D). The project is designed to spur egocentric research outside and inside of the company.


In collaboration with these universities and Facebook Reality Labs Research (FRL), Facebook AI is releasing five AI benchmarks that were collectively developed for academics, researchers, and developers to leverage in their work to advance the fundamental AI technology needed to build more useful AI assistants and home robots of the future.

The benchmarks include:


Episodic Memory: Given an egocentric video and a query, the Episodic Memory tasks requires localizing where the answer can be seen within the user’s past video


Hands and Objects: Hands and Objects tasks captures how the camera wearer changes the state of an object by using or manipulating it


Audio-Visual Diarization: The Audio-Visual Diarization benchmark is composed of four tasks: 1) localizing and tracking of speakers in a visual field of view, 2) active speaker detection, 3) diarization of speaker activity, 4) transcription of speech content


Social Interactions: The Social benchmark focuses on multimodal understanding of conversational interactions


Forecasting: The Forecasting benchmark includes four tasks: 1) locomotion prediction, 2) hand movement prediction, 3) short-term object interaction anticipation, and 4) long-term action anticipation

The Ego4D Dataset

Progress in this field requires large volumes of first-person data that has the scale, diversity, and complexity necessary to be useful in the real world. As part of Ego4D, our University partners collected thousands of hours of first-person unscripted video data with more than 700 research participants capturing hundreds of daily-life scenarios around the world. The participants vary across ages, demographics, and genders, and span 9 different countries, using off-the-shelf, head-mounted camera devices. This data will be available to the public research community later this year.

As a supplement to this work, researchers from Facebook Reality Labs used Vuzix Blade Glasses to collect an additional 400 hours of fully-consented, first-person video data in staged environments in our research labs.

Consortium Members

  • Carnegie Mellon University (CMU) and CMU-Africa

  • Georgia Institute of Technology

  • Indiana University

  • Massachusetts Institute of Technology

  • University of Minnesota

  • University of Pennsylvania

  • University of Catania

  • University of Bristol

  • University of Tokyo

  • International Institute of Information Technology, Hyderabad

  • King Abdullah University of Science and Technology

  • National University of Singapore

  • University of Los Andes