Skip to content

FLAG3D

  • Modality: Multi-view RGB video, 3D skeleton (SMPL), language descriptions
  • Primary Tasks: 3D fitness activity understanding, action recognition, action quality assessment, motion generation
  • Scale: 180,000+ sequences, 60 fitness action categories, 24 camera views per sequence
  • License: Research use only (non-commercial)
  • Access: https://andytang15.github.io/FLAG3D/

Summary

FLAG3D (Fine-grained Language-Aligned Grounded 3D) is a large-scale dataset for 3D fitness activity understanding that combines multi-view RGB videos, SMPL body parameters, 3D skeletons, and natural language descriptions. Captured in a professional motion capture studio with 24 synchronized cameras, it covers 60 categories of fitness actions performed by diverse subjects. Each sequence is annotated with fine-grained language descriptions at both the sequence and segment level, enabling research in text-driven action recognition, motion quality assessment, and language-conditioned motion generation. FLAG3D bridges the gap between vision-based action understanding and language-grounded motion analysis in the fitness domain.

Reference Paper

  • Yansong Tang, Jinpeng Liu, Aoyang Liu, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie Zhou, Xiu Li. "FLAG3D: A 3D Fitness Activity Dataset with Language Instruction." CVPR, 2024. PDF

Benchmarks & Baselines

  • ST-GCN - Top-1 Accuracy: 86.7% (skeleton-based action recognition) — Tang et al., CVPR 2024.
  • MotionBERT - Top-1 Accuracy: 91.2% (skeleton-based action recognition) — Tang et al., CVPR 2024.
  • Action Quality Assessment - Spearman correlation: 0.71 using CoRe framework — Tang et al., CVPR 2024.
  • Text-to-Motion - FID: 0.89, R-Precision Top-3: 0.68 using T2M-GPT — Tang et al., CVPR 2024.
  • Official train/val/test splits provided; cross-subject evaluation protocol.

Tooling & Ecosystem

  • Official code and tools: https://github.com/andytang15/FLAG3D
  • Data includes pre-extracted SMPL parameters, 3D joint positions, and multi-view RGB frames.
  • Compatible with MMAction2 for video-based recognition and PyTorch Geometric for skeleton-based methods.
  • Language annotations formatted for direct use with text-to-motion generation pipelines.

Known Challenges

  • Large dataset size (multi-view RGB data requires significant storage, estimated 2+ TB for full resolution).
  • Fitness actions can be visually similar (e.g., different squat variations), requiring fine-grained temporal and spatial reasoning.
  • Studio capture environment may limit generalization to in-the-wild fitness videos.
  • Action quality assessment annotations are inherently subjective; inter-annotator agreement varies by action category.
  • SMPL fitting quality varies for fast or complex movements.

Cite

@inproceedings{tang2024flag3d,
  title     = {FLAG3D: A 3D Fitness Activity Dataset with Language Instruction},
  author    = {Tang, Yansong and Liu, Jinpeng and Liu, Aoyang and Yang, Bin and Dai, Wenxun and Rao, Yongming and Lu, Jiwen and Zhou, Jie and Li, Xiu},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2024}
}