ActivityNet Captions¶
- Modality: RGB video (untrimmed YouTube videos) + natural language descriptions
- Primary Tasks: Dense video captioning, temporal grounding, video paragraph retrieval
- Scale: ~20,000 videos, ~100,000 temporally localized captions, 849 hours
- License: Research use only (ActivityNet terms)
- Access: https://cs.stanford.edu/people/ranber/densevid/
Summary¶
ActivityNet Captions extends the ActivityNet v1.3 dataset with dense temporal descriptions. Each video is annotated with a series of temporally localized natural language sentences that collectively describe the entire video content. Unlike single-sentence video captioning, ActivityNet Captions requires models to detect multiple events, localize their temporal boundaries, and generate descriptive captions for each event. This makes it a cornerstone benchmark for dense video captioning and temporal sentence grounding tasks, bridging video understanding and natural language processing.
Reference Paper¶
- Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles. "Dense-Captioning Events in Videos." ICCV, 2017.
PDF
Benchmarks & Baselines¶
- Dense Captioning (METEOR, val): PDVC — 7.48 — Wang et al., ICCV 2021.
- Dense Captioning (CIDEr, val): Vid2Seq — 30.1 — Yang et al., CVPR 2023.
- Temporal Grounding (R@1, IoU=0.5): 2D-TAN — 44.5% — Zhang et al., AAAI 2020.
- Evaluation uses the official val splits; metrics include METEOR, CIDEr, and BLEU for captioning, and Recall@IoU for grounding.
Tooling & Ecosystem¶
- Official project page provides annotations and evaluation tools.
- ActivityNet Captions evaluation server for challenge submissions.
- densevid_eval — official evaluation scripts.
- Video features (C3D, TSN) are commonly pre-extracted and shared by the community.
Known Challenges¶
- Dense captioning requires jointly solving temporal localization and language generation, making it significantly harder than either task alone.
- Caption quality varies; some annotations are generic or lack specificity.
- Evaluation metrics (METEOR, CIDEr) may not fully capture caption quality and temporal precision.
- YouTube video availability degrades over time.
- Long videos with many events create computational challenges for end-to-end models.