Skip to content

MultiTHUMOS

  • Modality: RGB video (untrimmed sports videos)
  • Primary Tasks: Dense multi-label temporal action detection, temporal action segmentation
  • Scale: 413 untrimmed videos, 65 action classes, 38,690 multi-label annotations, 30 hours of video
  • License: Research use only (non-commercial)
  • Access: https://ai.stanford.edu/~syyeung/everymoment.html

Summary

MultiTHUMOS extends the classic THUMOS-14 temporal action detection benchmark by providing dense, multi-label frame-level annotations. While THUMOS-14 annotates 20 action classes with single labels per temporal segment, MultiTHUMOS expands to 65 action classes and allows multiple simultaneous action labels per frame — reflecting the reality that humans often perform several actions concurrently (e.g., "running" while "dribbling" and "looking at ball"). The 2024 update brought renewed attention to the dataset with standardized multi-label evaluation protocols and new baseline results from transformer-based detectors, establishing MultiTHUMOS as the go-to benchmark for multi-label temporal action detection where models must predict overlapping action instances with varying durations.

Reference Paper

  • Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, Li Fei-Fei. "Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos." International Journal of Computer Vision (IJCV), 2018. PDF

Benchmarks & Baselines

  • Two-Stream I3D - Per-frame mAP: 29.7 — Piergiovanni & Ryoo, CVPR 2019.
  • TempAgg - Per-frame mAP: 34.6 — Piergiovanni & Ryoo, CVPR 2019.
  • ActionFormer - Per-frame mAP: 44.7 — Zhang et al., ECCV 2022.
  • TriDet - Per-frame mAP: 47.1 — Shi et al., CVPR 2023.
  • Primary metric: per-frame mAP averaged over all 65 classes; official validation and test splits follow the original THUMOS-14 partitioning.

Tooling & Ecosystem

  • Official annotations and download: https://ai.stanford.edu/~syyeung/everymoment.html
  • Videos sourced from THUMOS-14; requires downloading the original THUMOS-14 videos separately.
  • Compatible with ActionFormer and MMAction2 for temporal detection pipelines.
  • Pre-extracted I3D and VideoMAE features are shared by the community for feature-based detection methods.

Known Challenges

  • Multi-label evaluation requires careful handling of overlapping actions; standard single-label metrics are insufficient.
  • Significant class imbalance: common actions (standing, watching) have orders of magnitude more frames than rare ones.
  • Temporal boundaries between overlapping actions are often ambiguous, leading to annotation noise.
  • Videos are from sports broadcasts (primarily THUMOS-14 sports), limiting domain diversity.
  • Some action classes are highly correlated (co-occurring), making independent class evaluation misleading.

Cite

@article{yeung2018every,
  title   = {Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos},
  author  = {Yeung, Serena and Russakovsky, Olga and Jin, Ning and Andriluka, Mykhaylo and Mori, Greg and Fei-Fei, Li},
  journal = {International Journal of Computer Vision},
  volume  = {126},
  number  = {2--4},
  pages   = {375--389},
  year    = {2018}
}