UCF-101¶

Modality: RGB video (YouTube clips)
Primary Tasks: Action recognition, temporal modeling
Scale: 13,320 clips, 101 action classes, 27 hours of video
License: Research use only (UCF terms)
Access: https://www.crcv.ucf.edu/data/UCF101.php

Summary¶

UCF-101 is one of the most widely used benchmarks in action recognition. It contains 13,320 realistic action video clips sourced from YouTube, grouped into 101 categories spanning human-object interaction, body-only motion, human-human interaction, playing musical instruments, and sports. The dataset provides significant diversity in camera motion, object appearance, viewpoint, cluttered background, and lighting conditions, making it a challenging yet approachable benchmark for video understanding research.

Reference Paper¶

Khurram Soomro, Amir Roshan Zamir, Mubarak Shah. "UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild." arXiv:1212.0402, 2012. PDF

Benchmarks & Baselines¶

Two-Stream ConvNet - Top-1: 88.0% — Simonyan & Zisserman, NeurIPS 2014.
I3D (RGB + Flow) - Top-1: 98.0% — Carreira & Zisserman, CVPR 2017.
TimeSformer - Top-1: 96.0% — Bertasius et al., ICML 2021.
Standard evaluation uses 3 official train/test splits; accuracy is averaged over the 3 splits.

Tooling & Ecosystem¶

PyTorchVideo provides built-in UCF-101 dataloaders.
MMAction2 includes configs and pretrained models for UCF-101.
TensorFlow Datasets offers ready-to-use splits.

Known Challenges¶

Dataset is relatively small by modern standards; models pretrained on Kinetics can saturate performance.
Some YouTube links may have expired; the dataset is typically distributed as pre-downloaded archives.
Class overlap exists between certain categories (e.g., similar sports actions).

Cite¶

@article{soomro2012ucf101,
  title   = {UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild},
  author  = {Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal = {arXiv preprint arXiv:1212.0402},
  year    = {2012}
}