Benchmark Snapshots
High-level view of representative baselines and state-of-the-art (SOTA) models for key datasets. Use this to understand reasonable performance bands before planning experiments. Last updated: 2026-Q1.
Vision Datasets
| Dataset |
Metric |
Baseline (paper) |
SOTA (2025+) |
Notes |
| Kinetics-700 |
Top-1 |
SlowFast R101 (65.7) |
InternVideo2-6B (85.4) |
Foundation model pretraining now dominant; pure RGB. |
| UCF-101 |
Top-1 |
Two-Stream (88.0) |
VideoMAE V2-g (99.6) |
Near-saturated; mainly used for pretraining validation. |
| HMDB-51 |
Top-1 |
IDT+FV (61.7) |
VideoMAE V2-g (88.1) |
Harder than UCF-101 due to viewpoint/scene variation. |
| ActivityNet |
mAP |
TSN (89.0) |
InternVideo2 (93.2) |
Temporal proposals + classification; untrimmed videos. |
| AVA |
mAP @0.5 |
SlowFast R101 (24.5) |
VideoMAEv2 + MViTv2 (42.6) |
Spatio-temporal; person-level detection required. |
| NTU RGB+D 120 |
CS Top-1 |
Shift-GCN (85.9) |
InfoGCN (93.0) |
Cross-setup SOTA trails by ~2 points; skeleton methods lead. |
| Something-Something V2 |
Top-1 |
TSM (63.4) |
InternVideo2 (77.1) |
Temporal reasoning critical; longer clips help. |
| FineGym |
Event Top-1 |
TSM hierarchy (86.2) |
UniFormerV2-L (90.4) |
Hierarchical loss required; pose priors help on sub-actions. |
| Moments in Time |
Top-1 |
TSN (25.3) |
InternVideo2 (48.8) |
Extreme class diversity; multi-label noise present. |
| Diving48 |
Top-1 |
TSN (35.1) |
VideoMAE V2 (88.7) |
Fine-grained temporal structure; no appearance shortcut. |
| Toyota Smarthome |
CS Top-1 |
I3D (54.8) |
MS-G3D (73.2) |
Cross-view generalization is the hard evaluation. |
Skeleton & Mocap
| Dataset |
Metric |
Baseline |
SOTA (2025+) |
Notes |
| NTU RGB+D 60 |
CS Top-1 |
ST-GCN (81.5) |
InfoGCN (93.0) |
Foundation dataset for skeleton-based action recognition. |
| AMASS |
MPJPE |
VPoser prior (70 mm) |
MotionDiffuse (42 mm) |
Generative diffusion models dominate long sequence synthesis. |
| Human3.6M |
MPJPE |
Martinez et al. (67.5 mm) |
MotionBERT (37.2 mm) |
Report protocol matters; leaderboard splits differ by joints. |
| BABEL |
Seg. mIOU |
Transformer baseline (77.1) |
MotionBERT (84.3) |
Text-aligned supervision improves segmentation + retrieval. |
| TotalCapture |
MPJPE |
TotalCapture baseline (19 mm) |
PoseFormerV3D (13.6 mm) |
Multi-view fusion with transformer aggregation leads. |
| PKU-MMD |
mAP |
ST-GCN (93.7) |
FR-Head (96.2) |
Phase II (cross-subject) is the harder split. |
Wearable Sensors
| Dataset |
Metric |
Baseline |
SOTA (2025+) |
Notes |
| UCI-HAR |
Accuracy |
SVM (96.0) |
UniHAR (97.5) |
Near-saturated; 6 basic activities, smartphone only. |
| PAMAP2 |
Accuracy |
DeepConvLSTM (94.2) |
HARFormer (96.8) |
Domain augmentation (SpecAugment, jitter) helps >1 point. |
| WISDM |
Accuracy |
Random Forest (91.7) |
SelfHAR (96.1) |
Subject splits matter; report LOSO + random for comparability. |
| HAPT |
Accuracy |
SVM (96.3) |
MetaSenseNet (97.5) |
Postural transitions remain hardest; model F1 in addition to accuracy. |
| RealWorld HAR |
F1 |
Position-aware (86.7) |
AdaHAR (92.4) |
Evaluate across device placements; cross-location generalization critical. |
| OPPORTUNITY |
F1 (Gestures) |
DeepConvLSTM (70.1) |
CPM-Net (77.8) |
Data imbalance; class-weighted losses still recommended. |
Multimodal & Egocentric
| Dataset |
Metric |
Baseline |
SOTA (2025+) |
Notes |
| EPIC-Kitchens-100 |
Action mAP |
TSN RGB+Flow (38.9) |
InternVideo2 (52.6) |
Multi-modal (audio/text) pretraining narrows gap. |
| Ego4D |
Recall@5 (Episodic) |
CLIP retrieval (25.4) |
EgoVLP v2 (42.3) |
Foundation models fine-tuned on ego data lead. |
| Charades |
mAP |
Two-Stream I3D (32.9) |
InternVideo2 (52.8) |
Consider multi-label calibration for threshold selection. |
| Ego-Exo4D |
Top-1 (Cross-view) |
Baseline (46.5) |
CrossFormer++ (57.4) |
Joint ego-exo transformers are emerging; numbers moving fast. |
Using This Table
- Cite the relevant paper when referencing SOTA metrics; numbers change quickly.
- Reproduce baselines with provided configs in
tools/ (planned) to validate reproducibility.
- Open an issue tagged
type:benchmark when updating numbers, including validation logs or leaderboard links.