Skip to content

Something-Something V2

Summary

Something-Something V2 focuses on human-object interactions captured with short, templated instructions. Its fine-grained verb-noun combinations drive research in temporal reasoning, compositional action recognition, and grounding natural language descriptions to video.

Reference Paper

  • R. Goyal et al. "The 'Something Something' Video Database for Learning and Evaluating Visual Common Sense." ICCV, 2017. PDF

Benchmarks & Baselines

  • Temporal Shift Module (TSM) - Top-1: 63.4, Top-5: 89.5; Lin et al., ICCV 2019.
  • ViViT-L/16x2 - Top-1: 68.8, Top-5: 90.6; Arnab et al., ICCV 2021.
  • Standard split includes train, validation, and test; test labels withheld for leaderboard submission via EvalAI.

Tooling & Ecosystem

Known Challenges

  • Video downloads occasionally fail due to host availability; rerun downloader scripts.
  • Labels are heavily templated-consider text augmentation to avoid overfitting to phrasing.
  • Clips are low-resolution (240p); models benefit from temporal cues rather than spatial detail.

Cite

@inproceedings{goyal2017something,
  title     = {The `Something Something' Video Database for Learning and Evaluating Visual Common Sense},
  author    = {Goyal, Raghav and Kahou, Samira Ebrahimi and Michalski, Vincent and others},
  booktitle = {Proceedings of the IEEE International Conference on Computer Vision},
  year      = {2017}
}