MultiTHUMOS¶
- Modality: RGB video (untrimmed sports videos)
- Primary Tasks: Dense multi-label temporal action detection, temporal action segmentation
- Scale: 413 untrimmed videos, 65 action classes, 38,690 multi-label annotations, 30 hours of video
- License: Research use only (non-commercial)
- Access: https://ai.stanford.edu/~syyeung/everymoment.html
Summary¶
MultiTHUMOS extends the classic THUMOS-14 temporal action detection benchmark by providing dense, multi-label frame-level annotations. While THUMOS-14 annotates 20 action classes with single labels per temporal segment, MultiTHUMOS expands to 65 action classes and allows multiple simultaneous action labels per frame — reflecting the reality that humans often perform several actions concurrently (e.g., "running" while "dribbling" and "looking at ball"). The 2024 update brought renewed attention to the dataset with standardized multi-label evaluation protocols and new baseline results from transformer-based detectors, establishing MultiTHUMOS as the go-to benchmark for multi-label temporal action detection where models must predict overlapping action instances with varying durations.
Reference Paper¶
- Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, Li Fei-Fei. "Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos." International Journal of Computer Vision (IJCV), 2018.
PDF
Benchmarks & Baselines¶
- Two-Stream I3D - Per-frame mAP: 29.7 — Piergiovanni & Ryoo, CVPR 2019.
- TempAgg - Per-frame mAP: 34.6 — Piergiovanni & Ryoo, CVPR 2019.
- ActionFormer - Per-frame mAP: 44.7 — Zhang et al., ECCV 2022.
- TriDet - Per-frame mAP: 47.1 — Shi et al., CVPR 2023.
- Primary metric: per-frame mAP averaged over all 65 classes; official validation and test splits follow the original THUMOS-14 partitioning.
Tooling & Ecosystem¶
- Official annotations and download: https://ai.stanford.edu/~syyeung/everymoment.html
- Videos sourced from THUMOS-14; requires downloading the original THUMOS-14 videos separately.
- Compatible with ActionFormer and MMAction2 for temporal detection pipelines.
- Pre-extracted I3D and VideoMAE features are shared by the community for feature-based detection methods.
Known Challenges¶
- Multi-label evaluation requires careful handling of overlapping actions; standard single-label metrics are insufficient.
- Significant class imbalance: common actions (standing, watching) have orders of magnitude more frames than rare ones.
- Temporal boundaries between overlapping actions are often ambiguous, leading to annotation noise.
- Videos are from sports broadcasts (primarily THUMOS-14 sports), limiting domain diversity.
- Some action classes are highly correlated (co-occurring), making independent class evaluation misleading.
Cite¶
@article{yeung2018every,
title = {Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos},
author = {Yeung, Serena and Russakovsky, Olga and Jin, Ning and Andriluka, Mykhaylo and Mori, Greg and Fei-Fei, Li},
journal = {International Journal of Computer Vision},
volume = {126},
number = {2--4},
pages = {375--389},
year = {2018}
}