Skip to content

UCF-101

  • Modality: RGB video (YouTube clips)
  • Primary Tasks: Action recognition, temporal modeling
  • Scale: 13,320 clips, 101 action classes, 27 hours of video
  • License: Research use only (UCF terms)
  • Access: https://www.crcv.ucf.edu/data/UCF101.php

Summary

UCF-101 is one of the most widely used benchmarks in action recognition. It contains 13,320 realistic action video clips sourced from YouTube, grouped into 101 categories spanning human-object interaction, body-only motion, human-human interaction, playing musical instruments, and sports. The dataset provides significant diversity in camera motion, object appearance, viewpoint, cluttered background, and lighting conditions, making it a challenging yet approachable benchmark for video understanding research.

Reference Paper

  • Khurram Soomro, Amir Roshan Zamir, Mubarak Shah. "UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild." arXiv:1212.0402, 2012. PDF

Benchmarks & Baselines

  • Two-Stream ConvNet - Top-1: 88.0% — Simonyan & Zisserman, NeurIPS 2014.
  • I3D (RGB + Flow) - Top-1: 98.0% — Carreira & Zisserman, CVPR 2017.
  • TimeSformer - Top-1: 96.0% — Bertasius et al., ICML 2021.
  • Standard evaluation uses 3 official train/test splits; accuracy is averaged over the 3 splits.

Tooling & Ecosystem

Known Challenges

  • Dataset is relatively small by modern standards; models pretrained on Kinetics can saturate performance.
  • Some YouTube links may have expired; the dataset is typically distributed as pre-downloaded archives.
  • Class overlap exists between certain categories (e.g., similar sports actions).

Cite

@article{soomro2012ucf101,
  title   = {UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild},
  author  = {Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal = {arXiv preprint arXiv:1212.0402},
  year    = {2012}
}