Something-Something V2¶

Modality: RGB video (crowdsourced, 2-6 seconds)
Primary Tasks: Fine-grained action recognition, temporal reasoning, few-shot transfer
Scale: 220,847 video clips, 174 action templates, 338,000 textual descriptions
License: Apache License 2.0
Access: https://developer.qualcomm.com/software/ai-datasets/something-something

Summary¶

Something-Something V2 focuses on human-object interactions captured with short, templated instructions. Its fine-grained verb-noun combinations drive research in temporal reasoning, compositional action recognition, and grounding natural language descriptions to video.

Reference Paper¶

R. Goyal et al. "The 'Something Something' Video Database for Learning and Evaluating Visual Common Sense." ICCV, 2017. PDF

Benchmarks & Baselines¶

Temporal Shift Module (TSM) - Top-1: 63.4, Top-5: 89.5; Lin et al., ICCV 2019.
ViViT-L/16x2 - Top-1: 68.8, Top-5: 90.6; Arnab et al., ICCV 2021.
Standard split includes train, validation, and test; test labels withheld for leaderboard submission via EvalAI.

Tooling & Ecosystem¶

EvalAI leaderboard for official metrics.
PyTorchVideo and MMAction2 supply dataloaders.
something-something-downloader CLI to fetch videos and metadata.

Known Challenges¶

Video downloads occasionally fail due to host availability; rerun downloader scripts.
Labels are heavily templated-consider text augmentation to avoid overfitting to phrasing.
Clips are low-resolution (240p); models benefit from temporal cues rather than spatial detail.

Cite¶

@inproceedings{goyal2017something,
  title     = {The `Something Something' Video Database for Learning and Evaluating Visual Common Sense},
  author    = {Goyal, Raghav and Kahou, Samira Ebrahimi and Michalski, Vincent and others},
  booktitle = {Proceedings of the IEEE International Conference on Computer Vision},
  year      = {2017}
}