HumanML3D¶
- Modality: 3D human motion (SMPL joint positions and rotations) + natural language text descriptions
- Primary Tasks: Text-to-motion generation, motion-to-text retrieval, motion captioning
- Scale: 14,616 motion sequences, 44,970 text descriptions, ~28.6 hours of motion data
- License: Research use only
- Access: https://github.com/EricGuo5513/HumanML3D
Summary¶
HumanML3D is the primary benchmark for text-conditioned human motion generation. It combines motion capture data from HumanAct12 and AMASS, re-processed into a unified representation, and paired with 44,970 natural language descriptions (approximately 3 descriptions per motion). The text annotations describe what the person is doing, how they move, and the direction/speed of movement. HumanML3D has become the standard evaluation dataset for text-to-motion models (MDM, MotionDiffuse, T2M-GPT, MLD), establishing metrics like FID, R-Precision, and Diversity that are now used across the field.
Reference Paper¶
- Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Tianyu Ji, Xingyu Li, Li Cheng. "Generating Diverse and Natural 3D Human Motions from Text." CVPR, 2022.
PDF
Benchmarks & Baselines¶
- MDM - FID: 0.544, R-Precision (Top-3): 0.611 — Tevet et al., ICLR 2023.
- T2M-GPT - FID: 0.116, R-Precision (Top-3): 0.775 — Zhang et al., CVPR 2023.
- MLD - FID: 0.473, R-Precision (Top-3): 0.772 — Chen et al., CVPR 2023.
- MotionDiffuse - FID: 0.630, R-Precision (Top-3): 0.782 — Zhang et al., 2022.
- Standard evaluation uses FID, R-Precision (Top 1/2/3), Diversity, and MultiModality metrics with the official test split.
Tooling & Ecosystem¶
- HumanML3D repo provides data processing, feature extraction, and evaluation scripts.
- T2M-GPT, MDM, MLD all use HumanML3D as the primary benchmark.
- Motion representation uses 263-dimensional feature vectors (joint positions, velocities, rotations, foot contact).
Known Challenges¶
- Motion data originates from AMASS mocap recordings, which are lab-captured and may not represent diverse real-world movements.
- Text descriptions are crowd-sourced and vary in specificity and quality.
- Standard metrics (FID, R-Precision) rely on a learned feature extractor whose quality directly affects evaluation reliability.
- Limited action diversity: most motions are locomotion, gestures, and basic interactions; complex or domain-specific motions are underrepresented.
- Evaluation protocol is sensitive to random seeds; variance across runs should be reported.
Cite¶
@inproceedings{guo2022generating,
title = {Generating Diverse and Natural 3D Human Motions from Text},
author = {Guo, Chuan and Zou, Shihao and Zuo, Xinxin and Wang, Sen and Ji, Tianyu and Li, Xingyu and Cheng, Li},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}