Ego4D¶

Modality: Egocentric RGB video (mono + stereo), audio, IMU, eye gaze, depth (subset)
Primary Tasks: Episodic memory, social interactions, audio-visual diarization, hand-object interaction, forecasting
Scale: 3,000+ hours, 74 indoor/outdoor environments, 9 countries, 826 participants
License: Ego4D data use agreement (non-commercial research, requires acceptance via CodaLab)
Access: https://ego4d-data.org/docs/data/

Summary¶

Ego4D is the largest open egocentric video dataset to date, capturing daily-life activities with synchronized audio, IMU, gaze, and multi-view recordings. It supports five core benchmark tasks spanning episodic memory, social interactions, audio-visual diarization, hand-object interaction, and future anticipation.

Reference Paper¶

Kristen Grauman et al. "Ego4D: Around the World in 2,250 Hours of Egocentric Video." CVPR, 2022. PDF

Benchmarks & Baselines¶

Episodic Memory (Recall) - Baseline TOP-1: 15.6% using CLIP-based retrieval; Grauman et al., 2022.
Forecasting (Hand-Object Interaction) - Baseline mAP: 17.2; Grauman et al., 2022.
Leaderboards hosted on EvalAI per benchmark task.

Tooling & Ecosystem¶

Official ego4d-toolbox for dataset download, preprocessing, and baseline models.
Point-Owl for multi-task training on ego4D.
PyTorchVideo integration provides dataloaders and transforms.

Known Challenges¶

Large storage footprint (~1.4 PB uncompressed); plan selective download via official CLI to target benchmarks.
Licensing prohibits commercial use; requires maintaining participant privacy (face blurring in some sequences).
Annotation heterogeneity-each benchmark uses different metadata formats; rely on official parsers.

Cite¶

@inproceedings{grauman2022ego4d,
  title     = {Ego4D: Around the World in 2,250 Hours of Egocentric Video},
  author    = {Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and others},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2022}
}