Charades¶
- Modality: RGB video (crowdsourced indoor scenes), textual scripts, temporal annotations
- Primary Tasks: Multi-label action recognition, temporal localization, video question answering
- Scale: 9,848 videos, 157 action classes, 66,500 temporal annotations
- License: Creative Commons Attribution-NonCommercial 4.0
- Access: https://allenai.org/plato/charades/
Summary¶
Charades captures everyday indoor activities scripted and recorded by crowd workers in their homes. The dataset's multi-label temporal annotations, natural co-occurring actions, and narrative scripts make it a popular benchmark for compositional action recognition and temporal reasoning.
Reference Paper¶
- Gunnar A. Sigurdsson et al. "Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding." ECCV, 2016.
PDF
Benchmarks & Baselines¶
- Two-Stream I3D - mAP: 32.9 on Charades; Carreira & Zisserman, CVPR 2017.
- SlowFast Networks - mAP: 42.1; Feichtenhofer et al., ICCV 2019.
- Evaluation metrics: mean Average Precision (mAP) over classes using provided validation/test splits.
Tooling & Ecosystem¶
- Official Charades evaluation code including mAP scripts.
- AVA-Charades merges Charades with AVA annotations for spatio-temporal localization.
- Integration with PyTorchVideo and MMAction2.
Known Challenges¶
- Videos contain multiple actions simultaneously; multi-label training is essential.
- Long-tail distribution with rare actions; consider focal loss or class-balanced sampling.
- Lighting varies and some actions are subtle (e.g., "drinking" vs. "chewing") requiring temporal context.