Benchmark Snapshots¶

High-level view of representative baselines and state-of-the-art (SOTA) models for key datasets. Use this to understand reasonable performance bands before planning experiments. Last updated: 2026-Q1.

Vision Datasets¶

Dataset	Metric	Baseline (paper)	SOTA (2025+)	Notes
Kinetics-700	Top-1	SlowFast R101 (65.7)	InternVideo2-6B (85.4)	Foundation model pretraining now dominant; pure RGB.
UCF-101	Top-1	Two-Stream (88.0)	VideoMAE V2-g (99.6)	Near-saturated; mainly used for pretraining validation.
HMDB-51	Top-1	IDT+FV (61.7)	VideoMAE V2-g (88.1)	Harder than UCF-101 due to viewpoint/scene variation.
ActivityNet	mAP	TSN (89.0)	InternVideo2 (93.2)	Temporal proposals + classification; untrimmed videos.
AVA	mAP @0.5	SlowFast R101 (24.5)	VideoMAEv2 + MViTv2 (42.6)	Spatio-temporal; person-level detection required.
NTU RGB+D 120	CS Top-1	Shift-GCN (85.9)	InfoGCN (93.0)	Cross-setup SOTA trails by ~2 points; skeleton methods lead.
Something-Something V2	Top-1	TSM (63.4)	InternVideo2 (77.1)	Temporal reasoning critical; longer clips help.
FineGym	Event Top-1	TSM hierarchy (86.2)	UniFormerV2-L (90.4)	Hierarchical loss required; pose priors help on sub-actions.
Moments in Time	Top-1	TSN (25.3)	InternVideo2 (48.8)	Extreme class diversity; multi-label noise present.
Diving48	Top-1	TSN (35.1)	VideoMAE V2 (88.7)	Fine-grained temporal structure; no appearance shortcut.
Toyota Smarthome	CS Top-1	I3D (54.8)	MS-G3D (73.2)	Cross-view generalization is the hard evaluation.

Skeleton & Mocap¶

Dataset	Metric	Baseline	SOTA (2025+)	Notes
NTU RGB+D 60	CS Top-1	ST-GCN (81.5)	InfoGCN (93.0)	Foundation dataset for skeleton-based action recognition.
AMASS	MPJPE	VPoser prior (70 mm)	MotionDiffuse (42 mm)	Generative diffusion models dominate long sequence synthesis.
Human3.6M	MPJPE	Martinez et al. (67.5 mm)	MotionBERT (37.2 mm)	Report protocol matters; leaderboard splits differ by joints.
BABEL	Seg. mIOU	Transformer baseline (77.1)	MotionBERT (84.3)	Text-aligned supervision improves segmentation + retrieval.
TotalCapture	MPJPE	TotalCapture baseline (19 mm)	PoseFormerV3D (13.6 mm)	Multi-view fusion with transformer aggregation leads.
PKU-MMD	mAP	ST-GCN (93.7)	FR-Head (96.2)	Phase II (cross-subject) is the harder split.

Wearable Sensors¶

Dataset	Metric	Baseline	SOTA (2025+)	Notes
UCI-HAR	Accuracy	SVM (96.0)	UniHAR (97.5)	Near-saturated; 6 basic activities, smartphone only.
PAMAP2	Accuracy	DeepConvLSTM (94.2)	HARFormer (96.8)	Domain augmentation (SpecAugment, jitter) helps >1 point.
WISDM	Accuracy	Random Forest (91.7)	SelfHAR (96.1)	Subject splits matter; report LOSO + random for comparability.
HAPT	Accuracy	SVM (96.3)	MetaSenseNet (97.5)	Postural transitions remain hardest; model F1 in addition to accuracy.
RealWorld HAR	F1	Position-aware (86.7)	AdaHAR (92.4)	Evaluate across device placements; cross-location generalization critical.
OPPORTUNITY	F1 (Gestures)	DeepConvLSTM (70.1)	CPM-Net (77.8)	Data imbalance; class-weighted losses still recommended.

Multimodal & Egocentric¶

Dataset	Metric	Baseline	SOTA (2025+)	Notes
EPIC-Kitchens-100	Action mAP	TSN RGB+Flow (38.9)	InternVideo2 (52.6)	Multi-modal (audio/text) pretraining narrows gap.
Ego4D	Recall@5 (Episodic)	CLIP retrieval (25.4)	EgoVLP v2 (42.3)	Foundation models fine-tuned on ego data lead.
Charades	mAP	Two-Stream I3D (32.9)	InternVideo2 (52.8)	Consider multi-label calibration for threshold selection.
Ego-Exo4D	Top-1 (Cross-view)	Baseline (46.5)	CrossFormer++ (57.4)	Joint ego-exo transformers are emerging; numbers moving fast.

Using This Table¶

Cite the relevant paper when referencing SOTA metrics; numbers change quickly.
Reproduce baselines with provided configs in tools/ (planned) to validate reproducibility.
Open an issue tagged type:benchmark when updating numbers, including validation logs or leaderboard links.