Skip to content

NTU RGB+D Mutual Actions

  • Modality: RGB video, depth maps, 3D skeleton (25 joints per person), infrared
  • Primary Tasks: Two-person interaction recognition, mutual action recognition
  • Scale: 26 mutual action classes, ~14,000 interaction samples (subset of NTU RGB+D 120)
  • License: Research use only (requires signed agreement via NTU)
  • Access: https://rose1.ntu.edu.sg/dataset/actionRecognition/

Summary

NTU RGB+D Mutual Actions refers to the subset of mutual/interaction action classes from the NTU RGB+D 60 and NTU RGB+D 120 datasets. These samples involve two-person interactions such as handshaking, hugging, pushing, kicking, and pointing. The original NTU RGB+D 60 includes 11 mutual action classes (A50-A60), and NTU RGB+D 120 extends this to 26 mutual classes (adding A106-A120). Each sample provides synchronized skeleton data for both persons, making it the primary benchmark for skeleton-based interaction recognition. The two-person setting introduces unique challenges including person ordering ambiguity, interaction modeling, and relational reasoning between skeletons.

Reference Paper

  • Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, Alex C. Kot. "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding." IEEE TPAMI, 2020. PDF

Benchmarks & Baselines

  • ST-GCN (mutual subset, CS) - Top-1: ~82% — adapted from single-person ST-GCN.
  • 2s-AGCN (mutual subset) - Top-1: ~88% — Shi et al., CVPR 2019.
  • IGFormer - Top-1: ~93% — Li et al., ECCV 2022 (interaction-specific transformer).
  • Evaluation follows NTU RGB+D Cross-Subject and Cross-Setup protocols, filtered to mutual action classes only.

Tooling & Ecosystem

  • Skeleton data is accessed through the NTU RGB+D 60/120 download; mutual action classes can be filtered by action ID.
  • PYSKL supports multi-person skeleton action recognition.
  • Several interaction-specific GCN architectures have open-source implementations on GitHub.

Known Challenges

  • Person ordering: which skeleton is "person 1" vs "person 2" is arbitrary in Kinect capture, requiring order-invariant models.
  • Interaction semantics: actions like "pushing" and "patting" differ subtly and require modeling the relationship between two skeletons.
  • Some mutual actions are asymmetric (e.g., "pointing at someone"), but standard models treat both persons equivalently.
  • Missing or noisy second-person skeleton data when the Kinect fails to track both subjects simultaneously.

Cite

@article{liu2020ntu120,
  title   = {NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding},
  author  = {Liu, Jun and Shahroudy, Amir and Perez, Mauricio and Wang, Gang and Duan, Ling-Yu and Kot, Alex C.},
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  volume  = {42},
  number  = {10},
  pages   = {2684--2701},
  year    = {2020}
}