UCF-101¶
- Modality: RGB video (YouTube clips)
- Primary Tasks: Action recognition, temporal modeling
- Scale: 13,320 clips, 101 action classes, 27 hours of video
- License: Research use only (UCF terms)
- Access: https://www.crcv.ucf.edu/data/UCF101.php
Summary¶
UCF-101 is one of the most widely used benchmarks in action recognition. It contains 13,320 realistic action video clips sourced from YouTube, grouped into 101 categories spanning human-object interaction, body-only motion, human-human interaction, playing musical instruments, and sports. The dataset provides significant diversity in camera motion, object appearance, viewpoint, cluttered background, and lighting conditions, making it a challenging yet approachable benchmark for video understanding research.
Reference Paper¶
- Khurram Soomro, Amir Roshan Zamir, Mubarak Shah. "UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild." arXiv:1212.0402, 2012.
PDF
Benchmarks & Baselines¶
- Two-Stream ConvNet - Top-1: 88.0% — Simonyan & Zisserman, NeurIPS 2014.
- I3D (RGB + Flow) - Top-1: 98.0% — Carreira & Zisserman, CVPR 2017.
- TimeSformer - Top-1: 96.0% — Bertasius et al., ICML 2021.
- Standard evaluation uses 3 official train/test splits; accuracy is averaged over the 3 splits.
Tooling & Ecosystem¶
- PyTorchVideo provides built-in UCF-101 dataloaders.
- MMAction2 includes configs and pretrained models for UCF-101.
- TensorFlow Datasets offers ready-to-use splits.
Known Challenges¶
- Dataset is relatively small by modern standards; models pretrained on Kinetics can saturate performance.
- Some YouTube links may have expired; the dataset is typically distributed as pre-downloaded archives.
- Class overlap exists between certain categories (e.g., similar sports actions).