Core Concepts
Ego-Exo4D is a large-scale, diverse dataset that provides synchronized first-person (egocentric) and third-person (exocentric) video of skilled human activities, along with rich multimodal data and annotations, to enable new research on understanding skilled human behavior from multiple viewpoints.
Abstract
Ego-Exo4D is a novel dataset that captures skilled human activities from both first-person (egocentric) and third-person (exocentric) perspectives. The dataset was collected by a consortium of 15 research institutions and features 740 participants performing 43 activities across 8 domains (e.g., cooking, sports, music, healthcare) in 123 unique real-world scenes and 13 cities worldwide.
The dataset offers 1,286 hours of synchronized ego and exo video, along with rich multimodal data including audio, eye gaze, 3D point clouds, camera poses, and IMU. It also provides three types of language annotations: expert commentary that critiques the performance, first-person narrations by the participants, and third-person action descriptions.
The key insights and highlights of the dataset are:
Multimodal and multiview capture: Ego-Exo4D provides time-synchronized first-person and third-person video, enabling research on relating and translating between these complementary viewpoints.
Diverse skilled activities: The dataset covers a wide range of physical and procedural skilled activities performed by real-world experts, from sports and dance to cooking and bike repair.
Extensive annotations: In addition to the video and multimodal data, Ego-Exo4D offers rich annotations, including fine-grained activity keysteps, procedural dependencies, proficiency ratings, and 3D body/hand pose.
Benchmark tasks: The dataset introduces four families of benchmark tasks - ego-exo relation, ego(-exo) recognition, ego(-exo) proficiency estimation, and ego pose estimation - to push the frontier of first-person video understanding of skilled human activity.
Ego-Exo4D aims to fuel new research in areas such as egocentric perception, cross-view learning, multimodal activity understanding, and skill assessment. The dataset and all resources have been open-sourced to enable the broader research community.
Stats
"The dataset contains 1,286 hours of synchronized ego and exo video in total."
"There are 740 participants performing 43 activities across 8 domains in 123 unique real-world scenes."
"The dataset provides 432K sentences of atomic action descriptions, 117,812 pieces of expert commentary, and first-person narrations for 10% of the videos."
"Ego-Exo4D offers 14M frames of 3D ground truth and pseudo-ground truth annotations for body and hand pose estimation."
Quotes
"Advances in AI understanding of human skill could facilitate many applications. In augmented reality (AR), a person wearing smart glasses could quickly pick up new skills with a virtual AI coach that provides real-time guidance."
"We contend that both the egocentric and exocentric viewpoints are critical for capturing human skill. Firstly, the two viewpoints are synergistic. The first-person (ego) perspective captures the details of close-by hand-object interactions and the camera wearer's attention, whereas the third-person (exo) perspective captures the full body pose and surrounding environment context."
"Ego-Exo4D is the community's first diverse, large-scale multimodal multiview video resource. We have open sourced all the data, annotations, camera rig protocol, and benchmarks. With this release, we aim to fuel new research in ego-exo, multimodal activity, and beyond."