Motion-X++¶
- Modality: Whole-body motion capture (SMPL-X), text descriptions, facial expressions, hand poses
- Primary Tasks: Whole-body motion generation, text-to-motion synthesis, facial expression generation
- Scale: 120,900+ motion sequences, multi-granularity text labels (sequence-level and segment-level), covering diverse action categories
- License: Research use only (non-commercial); requires SMPL-X license
- Access: https://motion-x-dataset.github.io/
Summary¶
Motion-X++ is a substantial extension of the Motion-X dataset, designed to advance whole-body motion generation including facial expressions and hand gestures. It unifies motion data from multiple sources into SMPL-X format and pairs each sequence with multi-granularity text annotations — from coarse action labels to fine-grained natural language descriptions of body, hand, and face movements. The dataset supports text-driven generation of expressive full-body motions, filling a gap left by earlier datasets that focused only on body pose without facial or hand detail. Motion-X++ enables research in controllable motion synthesis, motion-language alignment, and expressive avatar animation.
Reference Paper¶
- Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang. "Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset." NeurIPS, 2024.
PDF
Benchmarks & Baselines¶
- T2M-GPT - FID: 0.116 on Motion-X test split — Lin et al., 2024.
- MotionGPT - FID: 0.232, R-Precision Top-3: 0.782 — Jiang et al., 2024.
- Evaluation follows text-to-motion generation protocol: FID, R-Precision, Diversity, and Multi-modality on the official test split.
- Separate evaluation tracks for body-only and whole-body (body + hands + face) generation.
Tooling & Ecosystem¶
- Official toolkit: https://github.com/IDEA-Research/Motion-X — includes data processing, visualization, and baseline training scripts.
- Requires SMPL-X body model for interpreting motion parameters.
- Compatible with HumanML3D evaluation pipeline for body-only benchmarks.
- Visualization tools support rendering in Blender and PyTorch3D.
Known Challenges¶
- Large storage requirements (hundreds of GB for the full dataset including all modalities).
- Facial expression and hand pose annotations have higher noise levels than body pose due to capture limitations.
- Multi-granularity text alignment is non-trivial: segment-level descriptions may not perfectly match temporal boundaries.
- Requires separate SMPL-X license agreement from MPI, adding friction to data access.
Cite¶
@inproceedings{lin2024motionx,
title = {Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset},
author = {Lin, Jing and Zeng, Ailing and Lu, Shunlin and Cai, Yuanhao and Zhang, Ruimao and Wang, Haoqian and Zhang, Lei},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024}
}