FLAG3D¶
- Modality: Multi-view RGB video, 3D skeleton (SMPL), language descriptions
- Primary Tasks: 3D fitness activity understanding, action recognition, action quality assessment, motion generation
- Scale: 180,000+ sequences, 60 fitness action categories, 24 camera views per sequence
- License: Research use only (non-commercial)
- Access: https://andytang15.github.io/FLAG3D/
Summary¶
FLAG3D (Fine-grained Language-Aligned Grounded 3D) is a large-scale dataset for 3D fitness activity understanding that combines multi-view RGB videos, SMPL body parameters, 3D skeletons, and natural language descriptions. Captured in a professional motion capture studio with 24 synchronized cameras, it covers 60 categories of fitness actions performed by diverse subjects. Each sequence is annotated with fine-grained language descriptions at both the sequence and segment level, enabling research in text-driven action recognition, motion quality assessment, and language-conditioned motion generation. FLAG3D bridges the gap between vision-based action understanding and language-grounded motion analysis in the fitness domain.
Reference Paper¶
- Yansong Tang, Jinpeng Liu, Aoyang Liu, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie Zhou, Xiu Li. "FLAG3D: A 3D Fitness Activity Dataset with Language Instruction." CVPR, 2024.
PDF
Benchmarks & Baselines¶
- ST-GCN - Top-1 Accuracy: 86.7% (skeleton-based action recognition) — Tang et al., CVPR 2024.
- MotionBERT - Top-1 Accuracy: 91.2% (skeleton-based action recognition) — Tang et al., CVPR 2024.
- Action Quality Assessment - Spearman correlation: 0.71 using CoRe framework — Tang et al., CVPR 2024.
- Text-to-Motion - FID: 0.89, R-Precision Top-3: 0.68 using T2M-GPT — Tang et al., CVPR 2024.
- Official train/val/test splits provided; cross-subject evaluation protocol.
Tooling & Ecosystem¶
- Official code and tools: https://github.com/andytang15/FLAG3D
- Data includes pre-extracted SMPL parameters, 3D joint positions, and multi-view RGB frames.
- Compatible with MMAction2 for video-based recognition and PyTorch Geometric for skeleton-based methods.
- Language annotations formatted for direct use with text-to-motion generation pipelines.
Known Challenges¶
- Large dataset size (multi-view RGB data requires significant storage, estimated 2+ TB for full resolution).
- Fitness actions can be visually similar (e.g., different squat variations), requiring fine-grained temporal and spatial reasoning.
- Studio capture environment may limit generalization to in-the-wild fitness videos.
- Action quality assessment annotations are inherently subjective; inter-annotator agreement varies by action category.
- SMPL fitting quality varies for fast or complex movements.
Cite¶
@inproceedings{tang2024flag3d,
title = {FLAG3D: A 3D Fitness Activity Dataset with Language Instruction},
author = {Tang, Yansong and Liu, Jinpeng and Liu, Aoyang and Yang, Bin and Dai, Wenxun and Rao, Yongming and Lu, Jiwen and Zhou, Jie and Li, Xiu},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024}
}