Skip to content

Skeletics-152

  • Modality: 2D/3D skeleton (estimated from RGB video)
  • Primary Tasks: Large-scale skeleton action recognition
  • Scale: ~150,000 clips, 152 action classes
  • License: Research use only
  • Access: https://github.com/skelemoa/quater-gcn

Summary

Skeletics-152 is a large-scale skeleton action recognition dataset created by extracting 2D and 3D pose estimates from a subset of Kinetics videos. With 152 action classes and approximately 150,000 clips, it is significantly larger and more diverse than lab-captured skeleton datasets like NTU RGB+D. The skeletons are estimated using pose estimation models (e.g., OpenPose, HRNet) rather than captured with depth sensors, making the data representative of in-the-wild conditions. This dataset bridges the gap between large-scale video action recognition and skeleton-based methods, enabling evaluation of skeleton models at scale.

Reference Paper

  • Anuj Gupta, Juhi Monga, Sai Srinivas Kancheti, Gandharv Relan, Ankur Sinha, Saurabh Gupta. "Quo Vadis, Skeleton Action Recognition?" International Journal of Computer Vision (IJCV), 2021. PDF

Benchmarks & Baselines

  • ST-GCN - Top-1: ~36% — Gupta et al., IJCV 2021.
  • QuaterGCN - Top-1: ~39% — Gupta et al., IJCV 2021.
  • MS-G3D - Top-1: ~41% — reported in follow-up works.
  • Standard evaluation uses the official train/val splits provided by the authors.

Tooling & Ecosystem

  • QuaterGCN repo provides data preparation scripts, skeleton extraction pipeline, and baseline code.
  • Skeleton extraction relies on OpenPose or HRNet.
  • Compatible with standard GCN-based skeleton action recognition codebases.

Known Challenges

  • Estimated skeletons are inherently noisier than sensor-captured data (Kinect); pose estimation failures on occluded or small subjects are common.
  • Requires downloading the original Kinetics videos, which are subject to YouTube link decay.
  • 152-class vocabulary at skeleton level is substantially harder than 60 or 120 classes; many classes are not easily distinguishable from skeleton data alone.
  • Multi-person scenes require person tracking and skeleton assignment as preprocessing.

Cite

@article{gupta2021quovadis,
  title   = {Quo Vadis, Skeleton Action Recognition?},
  author  = {Gupta, Anuj and Monga, Juhi and Kancheti, Sai Srinivas and Relan, Gandharv and Sinha, Ankur and Gupta, Saurabh},
  journal = {International Journal of Computer Vision},
  year    = {2021}
}