Skip to content

HumanML3D

  • Modality: 3D human motion (SMPL joint positions and rotations) + natural language text descriptions
  • Primary Tasks: Text-to-motion generation, motion-to-text retrieval, motion captioning
  • Scale: 14,616 motion sequences, 44,970 text descriptions, ~28.6 hours of motion data
  • License: Research use only
  • Access: https://github.com/EricGuo5513/HumanML3D

Summary

HumanML3D is the primary benchmark for text-conditioned human motion generation. It combines motion capture data from HumanAct12 and AMASS, re-processed into a unified representation, and paired with 44,970 natural language descriptions (approximately 3 descriptions per motion). The text annotations describe what the person is doing, how they move, and the direction/speed of movement. HumanML3D has become the standard evaluation dataset for text-to-motion models (MDM, MotionDiffuse, T2M-GPT, MLD), establishing metrics like FID, R-Precision, and Diversity that are now used across the field.

Reference Paper

  • Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Tianyu Ji, Xingyu Li, Li Cheng. "Generating Diverse and Natural 3D Human Motions from Text." CVPR, 2022. PDF

Benchmarks & Baselines

  • MDM - FID: 0.544, R-Precision (Top-3): 0.611 — Tevet et al., ICLR 2023.
  • T2M-GPT - FID: 0.116, R-Precision (Top-3): 0.775 — Zhang et al., CVPR 2023.
  • MLD - FID: 0.473, R-Precision (Top-3): 0.772 — Chen et al., CVPR 2023.
  • MotionDiffuse - FID: 0.630, R-Precision (Top-3): 0.782 — Zhang et al., 2022.
  • Standard evaluation uses FID, R-Precision (Top 1/2/3), Diversity, and MultiModality metrics with the official test split.

Tooling & Ecosystem

  • HumanML3D repo provides data processing, feature extraction, and evaluation scripts.
  • T2M-GPT, MDM, MLD all use HumanML3D as the primary benchmark.
  • Motion representation uses 263-dimensional feature vectors (joint positions, velocities, rotations, foot contact).

Known Challenges

  • Motion data originates from AMASS mocap recordings, which are lab-captured and may not represent diverse real-world movements.
  • Text descriptions are crowd-sourced and vary in specificity and quality.
  • Standard metrics (FID, R-Precision) rely on a learned feature extractor whose quality directly affects evaluation reliability.
  • Limited action diversity: most motions are locomotion, gestures, and basic interactions; complex or domain-specific motions are underrepresented.
  • Evaluation protocol is sensitive to random seeds; variance across runs should be reported.

Cite

@inproceedings{guo2022generating,
  title     = {Generating Diverse and Natural 3D Human Motions from Text},
  author    = {Guo, Chuan and Zou, Shihao and Zuo, Xinxin and Wang, Sen and Ji, Tianyu and Li, Xingyu and Cheng, Li},
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2022}
}