Datasets for Robotics: Navigation, Manipulation, and Mobile Manipulation

§ 00Executive summary

If you only read one section.

Manipulation has gone cross-embodiment.

The frontier moved from “130k Google episodes” (RT-1, 2022) to a million+ trajectories pooled across 20+ robots (Open X-Embodiment, AgiBot World, RoboMIND v2). The bet: one policy across every arm beats a per-robot specialist.

Mobile manipulation finally has data.

Mobile ALOHA, AgiBot A2-D, Galaxea R1, AIRBOT MMK2, TienKung. Whole-body bimanual teleop with both joint and dexterous-hand action spaces is now collected at scale.

Egocentric video is the real bet.

The bottleneck is robot teleop. The escape valve is human first-person video: Ego4D, EPIC-KITCHENS, EgoDex (829 hours on Apple Vision Pro), EgoVerse, Xperience. Together they will dwarf all robot data by 2027.

Navigation collapsed into VLA.

R2R, RxR, VLN-CE used to need dedicated waypoint models. Unified VLAs now treat navigation as an 8-waypoint chunk prediction — same loss, same decoder.

The loss has converged.

For continuous control, the field has converged on flow matching with action chunking. RL with PPO/GAE is layered on top to close the gap between imitation and closed-loop success.

Synthetic is no longer a hack.

InternData-A1, GR00T-X-Embodiment-Sim, and in-house pipelines now contribute 5–10% of mixtures. Critically, vision-free synthetic data turns out to teach the action prior better than vision-conditioned synthetic data.

§ 01A taxonomy

Five families of supervision.

Before listing datasets, fix the categories. A modern VLA is trained on a deliberate mixture of five data families. They differ in what signal they carry, not in what hardware shot them: a robot trajectory provides direct action labels, a YouTube clip does not. The mixture is the model.

Fig. 01 — Five data families collapsing into one shared output space.

Real teleop

Trajectories captured by a human teleoperating a real robot. Direct, action-labelled, expensive (≈$50–$300 per hour of usable data). Defines the bar for physical realism.

Simulation

Trajectories produced by a planner or RL agent in a renderer. Cheap, infinitely re-rolled, but always carries a sim-to-real gap that needs randomization and domain-adaptation work.

Egocentric human

First-person video of a human doing things. No native action labels; recovered via inverse dynamics, hand tracking, or latent action models. The only signal that scales like the internet.

Navigation

Long-horizon panoramic video paired with instructions or goal categories. Carries 3-DoF planar trajectories and weak semantic supervision.

Auxiliary VL

Anything that keeps the backbone literate: VQA, spatial grounding, action captions, driving VQA, dense embodied descriptions. Small slice of the mixture, outsized effect on instruction following.

§ 02Sample reel

What the data actually looks like.

Before the catalogue, the raw thing. Below: verified author-published teaser videos for the most-cited datasets in each family, a real cover sheet from Build AI's Egocentric-10K, and direct links to interactive sample browsers on every official project page. Open one and scrub through episodes before you trust the prose.

RT-1 · Robotics Transformer

Real manipulation

Official supplementary video from Google Robotics. Shows the policy executing 700+ tasks in kitchens and offices, with the corresponding teleop demonstrations beside each rollout.

Project page ↗

DROID · 76k Franka demos

Real manipulation

Author-narrated tour through the DROID corpus: 564 scenes, 86 tasks, three synchronized cameras, the same Franka Panda across 18 institutions.

Project page ↗

Mobile ALOHA · Stanford

Mobile manipulation

The video that recalibrated everyone's priors on whole-body teleop. Cooking shrimp, watering plants, riding elevators — all from ≈50 demos co-trained with static ALOHA data.

Project page ↗

Ego4D · Meta AI

Egocentric

Meta's official Ego4D launch reel: 3,670 hours of unscripted first-person video collected across 74 cities. The pretraining substrate behind almost every modern embodied vision encoder.

Project page ↗

Egocentric-10K · Build AI

Egocentric

Official cover sheet for the 10,000-hour factory egocentric corpus — the highest hand-visibility, highest active-manipulation density open ego dataset, captured on Build AI Gen 1 headsets at 1080p / 30 fps.

Hugging Face dataset card ↗

Sample sheet · open source ↗

BridgeData V2

Real manipulation

Browseable per-task sample grid — every one of the 60k WidowX trajectories has third-person + wrist-cam video plus a language label.

3rd-person RGBWrist camΔEEF actionsLang instruction

Open sample browser ↗

Sample sheet · open source ↗

Open X-Embodiment Explorer

Real manipulation

Interactive viewer across 22 embodiments. The fastest way to feel just how heterogeneous the cross-embodiment pool actually is.

22 embodiments60 datasetsUnified RLDS schema

Open sample browser ↗

Sample sheet · open source ↗

AgiBot World Colosseo

Mobile manipulation

Five fully-replicated home, retail and office environments captured with a 100-robot farm. Sample videos and per-task statistics on the OpenDriveLab page.

Bimanual humanoidDexterous hand100+ scenarios

Open sample browser ↗

Sample sheet · open source ↗

EPIC-KITCHENS-100

Egocentric

100 hours of densely-narrated first-person cooking. The interactive visualizer lets you scrub verb/noun annotations frame-by-frame.

First-person RGBVerb · noun90k action segments

Open sample browser ↗

Sample sheet · open source ↗

EgoExo4D

Egocentric

Same activity captured from one egocentric and four exocentric cameras simultaneously, paired with expert commentary. The bridge dataset for ego↔exo viewpoint transfer.

Ego + 4 exoGaze + IMUExpert narration

Open sample browser ↗

Sample sheet · open source ↗

Matterport3D / HM3D viewer

Navigation

Walk through the 1,000 building-scale scans that back R2R, RxR, ObjectNav and HM3D-Semantics in your browser.

Panoramic RGB-DMeshRoom semantics

Open sample browser ↗

Sample sheet · open source ↗

Room-Across-Room (RxR)

Navigation

Multilingual VLN: paths annotated with time-aligned spoken instructions in English, Hindi, and Telugu. The official site lets you replay any episode.

Pano RGBEN / HI / TEPose timing

Open sample browser ↗

Sample sheet · open source ↗

LIBERO benchmark suite

Simulation

Standardized lifelong-manipulation benchmark — Spatial / Object / Goal / Long suites with downloadable rollouts.

Franka sim130 tasksWrist + 3rd-person

Open sample browser ↗

Sample sheet · open source ↗

RoboCasa kitchens

Simulation

Procedurally generated photorealistic kitchens for the GR-1 humanoid. Sample task videos live on the project page.

Procedural scenes100+ tasksBimanual humanoid

Open sample browser ↗

§ 03Scale, side-by-side

Where the field is actually investing collection effort.

A log-scale comparison of the headline volumes for every dataset surveyed in this guide. Units are not interchangeable — a teleop hour and a YouTube hour cost three orders of magnitude apart and carry different supervision — but the chart maps where the field is currently putting its scaling pressure.

Order-of-magnitude scale

Log₁₀ scale · units differ per family

ManipulationMobile manipEgocentricNavigationSimulationAuxiliary

LAION-5B

5.0 B pairs

AgiBot World

1 M traj

Open X-Embodiment

1 M+ traj

RDT-1B

≈1 M ep

RT-1

130 k ep

RxR

126 k instr

RH20T

110 k clips

RoboMIND v2

107 k traj

DROID

76 k demos

BridgeData V2

60 k traj

BEHAVIOR-1K

1k tasks · 50 scenes

BC-Z

26 k traj

R2R

21,567 instr

Egocentric-10K

10,000 h

Ego4D

3,670 h

EgoExo4D

1,286 h

HM3D

1,000 scans

EgoDex

829 h

Assembly101

513 h

Galaxea Open World

≈500 h

Mobile ALOHA

≈350 demos

HoloAssist

166 h

LIBERO

130 tasks

EPIC-KITCHENS-100

100 h

GNM / ViNT corpora

100+ h

RoboCasa

100+ tasks

Hours and trajectories are not the same currency — ten teleop hours can encode the same skill repertoire as a thousand passive video hours. Read the chart as a map of where the field is investing collection effort, not as a ranking of per-sample value.

§ 04Modality coverage

What signal each family natively ships.

A policy is only as multimodal as its data. This matrix shows, family by family, which sensor and label streams are typically present out-of-the-box. Read it as a checklist when designing a mixture: a gap here is a gap in your final policy unless another family fills it.

Modality coverage by family

What each family natively ships

Core supervisionCommonly shippedOccasional

RGB video

Multi-view

Wrist cam

Depth / RGB-D

LiDAR / mesh

Joint angles

EEF pose / Δ

Gripper / DH

Force-torque

Audio

Hand pose

Eye gaze / IMU

Panoramic RGB

Language

Waypoints / traj

Real teleop

Mobile teleop

Egocentric

Navigation

Simulation

Auxiliary VL

§ 05Manipulation · real robots

Teleoperation is still the gold standard.

Robot manipulation trajectories form ≈74% of the Qwen-VLA pretraining mixture and a similar share of every serious VLA. The supervision is the cleanest you can buy: synchronized multi-view RGB, a language instruction, and a chunk of future actions in the robot's native control convention. The price is the catch: a single high-quality teleop hour costs more than a thousand hours of YouTube.

RT-1

Google · 2022

Real manipulation

First demonstration that a single Transformer policy could absorb hundreds of skills across kitchens and offices when fed enough teleoperated data.

Scale

130k episodes / 17 months

Embodiment

Mobile manipulator (Everyday Robots)

RGBLangΔEEFGripper

BridgeData V2

UC Berkeley · 2024

Real manipulation

Open, low-cost teleop corpus designed for cross-task generalization on a cheap arm. The de-facto benchmark for low-budget VLA research.

Scale

60k trajectories · 13 skills · 24 environments

Embodiment

WidowX 250 (single-arm)

RGBWrist camLangΔEEF

DROID

Stanford / 18 labs · 2024

Real manipulation

Distributed-collection consortium dataset capturing the same robot across 18 institutions, dorms, kitchens, labs. Designed explicitly for scene diversity.

Scale

76k demonstrations · 564 scenes · 86 tasks

Embodiment

Franka Panda (single-arm)

3× RGBDepthLangΔEEF / Joint

RH20T

SJTU · 2023

Real manipulation

One-shot generalist data with synchronized multi-view RGB-D, contact, audio, and language — built so policies can imitate from a single human-shown demo.

Scale

≈110k clips · 147 tasks · 7 robot types

Embodiment

Multi-arm

RGB-DAudioForce-torqueLang

BC-Z

Google / Stanford · 2022

Real manipulation

Showed that zero-shot task generalization is possible when teleoperation data is paired with language and human video task descriptions.

Scale

≈26k trajectories · 100 tasks

Embodiment

7-DoF arm

RGBLanguage or video task spec

RoboMIND v1/v2

Beijing Academy of AI · 2025

Real manipulation

Multi-embodiment benchmark dataset emphasizing standardized task taxonomies and failure-case annotations across heterogeneous platforms.

Scale

v2: 107k+ trajectories · 479 tasks · 96 object classes

Embodiment

Franka, AgileX, Tien Kung, UR-5e

Multi-view RGBDepthLangJointEEF

AgiBot World

AgiBot · 2025

Real manipulation

Largest open bimanual humanoid corpus to date — five fully replicated home/retail environments collected with a 100-robot farm.

Scale

1M+ trajectories · 217 tasks · 100+ scenarios

Embodiment

AgiBot A2-D (bimanual humanoid)

RGBDepthJointDexterous handLang

Galaxea Open World

Galaxea AI · 2025

Real manipulation

Bimanual mobile teleoperation in homes and offices, distributed in a unified observation-action schema for cross-embodiment training.

Scale

≈500 hours

Embodiment

Galaxea R1 (bimanual mobile)

RGBJointLangGripper

RDT-1B

Tsinghua · 2025

Real manipulation

Curated cross-embodiment bimanual aggregate behind the RDT-1B diffusion policy — emphasizes physically aligned bimanual coordination.

Scale

≈1M episodes (aggregated)

Embodiment

Bimanual

RGBLangJoint

RoboCOIN

Multi-institution · 2025

Real manipulation

An open consortium release pooling teleop and simulation across labs into a single, license-clean training pool.

Scale

Multi-platform composite

Embodiment

Heterogeneous

RGBLangAction

RobotSet

CMU / Berkeley · 2023

Real manipulation

Early generalist dataset designed for skill transfer studies across morphologies; an ancestor of Open X-Embodiment style pooling.

Scale

Aggregated

Embodiment

Multiple

RGBLang

Open X-Embodiment

DeepMind + 21 institutions · 2024

Real manipulation

The unifying schema that made cross-embodiment training tractable. Most modern VLA pretraining recipes draw a large share of their tokens from this pool.

Scale

1M+ trajectories · 22 embodiments · 60 datasets

Embodiment

Cross-embodiment

RGBLangVarious action spaces

§ 06Mobile manipulation

The whole body is the action space.

Mobile manipulation data is fundamentally different from tabletop teleop because the base is now part of the action. The policy has to coordinate a 3-DoF chassis with two arms (and often two dexterous hands) over minutes, not seconds. Datasets here are smaller, harder to collect, and the most strategically important — they are what general-purpose home robots will be trained on.

Mobile ALOHA

Stanford · 2024

Mobile manipulation

Low-cost wheeled bimanual teleop platform that proved a few dozen demos plus co-training with static ALOHA data unlock cooking, laundry, and elevator-riding.

Scale

≈50 demos × 7 long-horizon tasks

Embodiment

Wheeled bimanual

RGBJointWhole-body teleop

AgiBot A2-D Trajectories

AgiBot · 2025

Mobile manipulation

Whole-body bimanual data with a wheeled base; supports both absolute joint and dexterous-hand action spaces.

Scale

Subset of AgiBot World

Embodiment

Bimanual humanoid + chassis

RGBJointDexterous hand

Galaxea R1

Galaxea AI · 2025

Mobile manipulation

A commercial bimanual mobile platform whose dataset prioritizes home and storefront tasks with reproducible layouts.

Scale

Hundreds of hours

Embodiment

Bimanual mobile

RGBLangJoint

AIRBOT MMK2

Discover Robotics · 2025

Mobile manipulation

Cheap mobile bimanual with dexterous hands — a popular choice for academic mobile-manipulation studies.

Scale

Open release

Embodiment

Bimanual mobile + dexterous hands

RGBJointDH

TienKung

Beijing X-Humanoid · 2025

Mobile manipulation

Full-body humanoid trajectories with both gripper and dexterous-hand action spaces, useful for embodiment-aware co-training.

Scale

Open subset

Embodiment

Humanoid (bipedal)

RGBJointDH

BEHAVIOR-1K

Stanford · 2024

Mobile manipulation

Simulation-side mobile manipulation: a thousand annotated household activities, useful when paired with sim-to-real navigation pretraining.

Scale

1,000 activities · 50 scenes

Embodiment

Mobile manipulator (sim)

RGB-DLangWhole-body

§ 07Egocentric human data

The data source that scales like the web.

A single Vision Pro session can collect more dexterous hours in a weekend than an entire teleop lab does in a year. Egocentric human video carries no native action labels, so the field has invested heavily in pipelines that recover frame-wise 3D hand and finger pose — the input to inverse-dynamics models or latent action models that produce pseudo-action supervision. Recent VLAs allocate 6–10% of their mixture here, and the number is climbing.

Ego4D

FAIR + 14 universities · 2022

Egocentric human

The foundational egocentric-video pretraining set. Used to seed almost every modern visual encoder for embodiment (R3M, MVP, VC-1).

Scale

3,670 hours · 855 participants · 74 cities

Egocentric RGBAudio3D scanNarration

EPIC-KITCHENS-100

Bristol / Toronto / Catania · 2022

Egocentric human

Dense first-person kitchen activity with verb/noun grounding; the canonical benchmark for fine-grained manipulation understanding.

Scale

100 hours · 700 sessions · 90k action segments

Egocentric RGBNarrationAction verbs

EgoDex

Apple · 2026

Egocentric human

Largest dexterous egocentric set captured on Vision Pro — paired 3D hand-and-finger tracking finally makes human video usable as direct action supervision.

Scale

829 hours · 194 tabletop tasks

Egocentric RGBApple Vision Pro hand trackingFinger pose

EgoVerse

Open consortium · 2026

Egocentric human

A collaborative ego platform built to ship in one unified format so labs can pool data without per-dataset wrappers.

Scale

1,300+ hours · 1,965 tasks · 240 scenes

Egocentric RGBHand poseStandardized schema

Xperience

Ropedia · 2026

Egocentric human

Synchronized first-person multimodal recordings with hierarchical instruction annotations — designed as the next-generation Ego4D successor.

Scale

Large-scale multimodal

RGBDepthHand & body mocapHierarchical lang

Egocentric-10K

Build AI · 2025

Egocentric human

The largest egocentric corpus to date and the first collected exclusively inside real factories. State-of-the-art in hand visibility and active-manipulation density — built specifically as pretraining fuel for industrial VLA and dexterous policies.

Scale

10,000 hours · 1.08B frames · 192.9k clips · 16.4 TB

Embodiment

Head-mounted (Build AI Gen 1)

1080p RGB @30fps128° FoVCamera intrinsics

Egocentric-10K-Evaluation

Build AI · 2025

Egocentric human

Held-out evaluation slice of Egocentric-10K with dense annotations — used to score hand detection, contact, and active-object grounding in factory settings.

Scale

30,000 annotated frames

RGBHand / contact annotations

HoloAssist

Microsoft · 2023

Egocentric human

Mixed-reality egocentric capture of real assembly/repair tasks with synchronized instructor speech — the canonical procedural-assistance dataset for embodied LLMs.

Scale

166 hours · 222 instructor–performer pairs · 20 object procedures

HoloLens 2 RGBDepthIMUEye gazeHand poseDialogue

Assembly101

Meta / TUM / Singapore · 2022

Egocentric human

Multi-view (ego + exo) procedural assembly with fine-grained mistake annotations; widely used for action segmentation and error detection.

Scale

513 hours · 4,321 videos · 101 toy assemblies

8× fixed RGB4× egocentric RGB3D hand pose

EgoExo4D

FAIR + 15 universities · 2024

Egocentric human

Paired egocentric + exocentric capture of skilled activities (cooking, sports, repair) with expert narration — the bridge dataset for ego↔exo viewpoint transfer.

Scale

1,286 hours · 740 participants · 13 cities

Ego RGB4–5 exo RGBGazeIMUExpert commentary

§ 08Navigation

Long horizons, sparse instructions.

Navigation data is the only family where each episode is naturally long-horizon: the robot has 3 DoF (planar translation and heading) and must execute commands that mix motion primitives with text landmarks. Unified VLAs cast this as predicting an 8-waypoint chunk per step, supervised by the same flow-matching loss used for manipulation. Qwen-VLA splits its navigation 7.5% slice three ways: instruction following (4.3%), object searching (2.3%), and target tracking (1.0%).

R2R (Room-to-Room)

ANU / Adelaide · 2018

Navigation

The original Vision-and-Language Navigation benchmark in Matterport3D. Still the standard OSR/SR benchmark for instruction following.

Scale

21,567 instructions · 7,189 paths

Panoramic RGBEnglish instructions

RxR (Room-Across-Room)

Google · 2020

Navigation

Multilingual, dense, time-aligned VLN with longer paths — the harder cousin of R2R, used to test long-horizon instruction grounding.

Scale

126k instructions · 16.5k paths · 3 languages

Panoramic RGBEN/HI/TE instructionsPose timing

VLN-CE

Georgia Tech · 2020

Navigation

Same instructions, but in continuous Habitat environments. Forces models to learn low-level locomotion, not just discrete graph hops.

Scale

R2R/RxR ported to continuous control

RGB-DLangLow-level actions

HM3D / Habitat-Matterport

Meta + Matterport · 2021

Navigation

Highest-fidelity 3D scan set for indoor navigation; the substrate behind most modern object-goal and exploration agents.

Scale

1,000 building-scale scans

RGB-DMeshSemantics

ObjectNav (Habitat Challenge)

Meta · 2022–

Navigation

Find an instance of an object class in an unseen home. The standard semantic-exploration benchmark.

Scale

HM3D-Semantics episodes

RGB-DObject category target

ScanNet / ScanNet++

Stanford / TUM · 2017 / 2023

Navigation

Reconstructed indoor scans used for navigation pretraining, scene understanding, and as the basis for downstream VLN evaluation.

Scale

1,500+ indoor scans (++: 460 hi-fi)

RGB-DMeshSemantic + instance

GNM / ViNT / NoMaD corpora

UC Berkeley · 2023–24

Navigation

Cross-embodiment driving footage from many wheeled platforms — the data behind general navigation policies that transfer between robots.

Scale

100+ hours of multi-robot driving

Front RGBOdometry

§ 09Simulation & synthetic

The infinite, imperfect oracle.

Simulation is no longer a backup. It now serves three distinct roles: a benchmark (LIBERO, Simpler, RoboCasa, RoboTwin), a source of pretraining trajectories (InternData-A1, GR00T-X-Embodiment-Sim), and the rollout environment for reinforcement learning on top of an imitation-pretrained policy. The Qwen-VLA T2A ablation contains a striking result: ≈20% synthetic mixed with 80% real, with vision suppressed,beats every other ratio by ~10 points downstream.

LIBERO

UT Austin · 2023

Simulated manipulation

Lifelong learning benchmark with four axes: Spatial, Object, Goal, Long. The standard simulation report card for VLAs.

Scale

4 suites · 130 long-horizon tasks

Embodiment

Single Franka

RGBWrist camLangΔEEF

SimplerEnv

Stanford / Google · 2024

Simulated manipulation

A real-to-sim suite engineered so simulator success correlates with real-robot success — used widely as a cheap proxy for hardware evaluation.

Scale

Aligned to real WidowX / Google Robot

Embodiment

WidowX, Google Robot

RGBLangΔEEF

RoboCasa

UT Austin / NVIDIA · 2024

Simulated manipulation

Procedurally generated kitchens with photorealistic assets, designed for everyday household manipulation evaluation.

Scale

100+ kitchen tasks · 120 scenes

Embodiment

Mobile + bimanual humanoid (GR-1)

RGBLangJoint

RoboTwin 2.0

Open community · 2025

Simulated manipulation

A dual-arm benchmark with a careful difficulty split that exposes failure modes in long-horizon bimanual coordination.

Scale

50 bimanual tasks · Easy & Hard tiers

Embodiment

Dual-arm

RGBLangJoint

InternData-A1

Shanghai AI Lab · 2025

Simulated manipulation

Simulation trajectories generated by motion planners in diverse virtual scenes — used to widen long-tail object and layout coverage.

Scale

Large-scale planner trajectories

Embodiment

Multiple

RGBLangAction

GR00T-X-Embodiment-Sim

NVIDIA · 2025

Simulated manipulation

Synthetic counterpart to GR00T's training stack: procedurally varied scenes rendered across many embodiments to seed a universal policy.

Scale

Cross-embodiment synthetic

Embodiment

Multiple

RGBJointLang

DOMINO

Open community · 2026

Simulated manipulation

Zero-shot evaluation of dynamic skills (pouring, sliding, throwing) where contact dynamics dominate — the hardest known generalization probe.

Scale

Dynamic manipulation suite

Embodiment

Single-arm

RGBLang

§ 10Auxiliary vision-language

What keeps the backbone literate.

Auxiliary VL data — driving VQA, 2D spatial grounding, fine-grained action captions, general image-text — is a small fraction of a typical mixture (≈8.5% in Qwen-VLA) but does outsized work: it stops the action-loss gradient from quietly destroying the VLM's language understanding, and it is the only place the model learns the dense vocabulary needed for fine-grained instructions like “rotate clockwise, then slide left.”

nuScenes / Waymo Open / Argoverse 2

Various · 2019–23

Auxiliary

Autonomous-driving VQA and motion-forecasting corpora feed trajectory-centric supervision into general embodied models.

Scale

Thousands of driving scenes

RGBLiDARHD mapsTrajectories

RefCOCO / RefCOCOg / Visual Genome

Various · 2014–17

Auxiliary

2D spatial-grounding data that keeps a VLA backbone literate in “the red mug on the left.”

Scale

Millions of region-phrase pairs

RGBBounding boxesLang

LAION / DataComp / OBELICS

LAION / community · 2022–24

Auxiliary

The general vision-language substrate used during continual pretraining so the action model does not forget how to read the world.

Scale

Billions of image-text pairs

RGBLangInterleaved

Fine-grained Embodied Captions

Curated · 2025–26

Auxiliary

Dense action-level captions (“rotate clockwise, then slide left”) that disambiguate the same coarse label collapsing to two different motions.

Scale

≈0.2% of mixtures — small but critical

RGB clipDense action description

§ 11Learning techniques

One table to match data to method.

Every dataset above implies a learning technique. Real teleop wants behavior cloning or flow matching. Ego video wants inverse dynamics or a latent action model. Simulation rollouts want PPO. Below: the techniques that matter in 2026, what they consume, and what each is uniquely good at.

Technique	Data it consumes	Loss	Why it matters
Behavior Cloning (BC) Supervised mimicry Best forPlenty of clean teleop, single embodiment	Real teleop trajectories (RT-1, BridgeData V2, DROID)	MSE / cross-entropy on actions	The starting point for almost every manipulation policy. Cheap, but compounds errors out of distribution. Used as the warm-up for everything below.
Action Chunking + Transformer (ACT) Predict an action chunk, not one step Best forBimanual & contact-rich tasks (ALOHA, Mobile ALOHA)	ALOHA & RoboMIND-style synchronized bimanual	L1 over chunked actions + VAE prior	Predicts `H` actions at once and re-plans every `k` steps. The single most important architectural trick for high-frequency bimanual policies.
Diffusion Policy Denoise the next action chunk Best forMultimodal action distributions, dexterous tasks	Mixed teleop with multiple valid solutions	DDPM / DDIM noise prediction	Treats the chunk of future actions as an image to denoise. Captures the multi-modal nature of human teleoperation that L2 losses average away.
Flow Matching for Actions Continuous-time denoising decoder Best forVLA action experts (π₀, Qwen-VLA, GR00T)	Cross-embodiment continuous control	Velocity-field regression with Beta / Sigmoid-Normal timestep priors	The new default. Cheaper than diffusion at inference, and lets a vision-language backbone attach a small DiT action expert that consumes language tokens directly.
Vision-Language-Action Pretraining VLM + action head, one model Best forCross-task, cross-embodiment generalization	Open X-Embodiment + sim + ego video	Next-token LM loss + flow-matching action loss	RT-2, OpenVLA, π₀, Qwen-VLA. The big idea: keep a pretrained VLM literate, tack on an action expert, supervise with both losses simultaneously.
Text-to-Action (T2A) Pretraining Learn actions from language alone, no images Best forEstablishing an action prior before visual grounding	Synthetic + real trajectories with images dropped	Flow matching with Sigmoid-Normal τ-schedule	Qwen-VLA shows that pretraining the action decoder on language-conditioned trajectories — vision suppressed — beats no-T2A by +10.2 pp downstream. Forces the decoder to ground in language, not visual shortcuts.
Embodiment-Aware Prompt Conditioning Tell the model which robot it is Best forMulti-robot, multi-action-space training	Any cross-embodiment mixture	—	Prepend a textual description of the platform, arm count, control frequency, and action space. Removes the need for embodiment-specific heads and enables zero-shot transfer to new robots.
Per-Dataset Quantile Normalization Scale-free action targets Best forPooling heterogeneous teleop sources	Mixed real-robot trajectories	—	Each dataset's action dimensions are mapped to [-1, 1] using the 1st/99th quantiles per source. Removes scale differences across embodiments without losing relative motion structure.
Inverse Dynamics + Pseudo-Actions Recover actions from video Best forEgocentric human video, action-less footage	Ego4D, EPIC-KITCHENS, YouTube	Finite differences on proprioception or learned IDM	Most ego data ships without explicit actions. Frame-wise hand pose plus an inverse-dynamics network produces pseudo-action labels that train policies as if a human were the teleoperator.
Latent Action Models (LAMs) Discover an action space from video Best forWeb video at scale	Unlabeled internet video (Genie-style)	Reconstruction of next frame from latent	Compress “what changed between two frames” into a small vector, then learn a policy that emits those latents. Used by Genie 2/3 and several robot world-model pipelines to unify human video with robot actions.
World-Model Imitation Train policies inside a learned simulator Best forSample-efficient RL, policy evaluation	Video + action pairs (DreamGen, UniSim, DreamerV3)	RL or imitation inside imagined rollouts	Pretrain on human video, fine-tune on a small robot dataset, then practice in imagined rollouts. The current frontier for closing the data gap between teleop and reality.
RL with PPO / GAE on Sparse Rewards Optimize for closed-loop success Best forPushing SFT checkpoints past imitation ceilings	On-policy rollouts in simulation	PPO clipped surrogate + value head	Likelihood-based SFT teaches the policy to imitate; RL teaches it to succeed. Qwen-VLA, π₀-RL, and HIL-SERL all use PPO/GAE on simulator success signals.
Vision-and-Language Navigation Imitation Predict waypoints from instruction + history Best forR2R, RxR, VLN-CE	Pano video + instruction transcripts	Cross-entropy on discrete actions or flow matching on waypoints	Modern unified models treat VLN as just another action-and-trajectory prediction problem: 8 future waypoints per chunk, supervised the same way as manipulation.

§ 12Ground-truth collection

From sensor stream to gradient step.

A dataset is the output of a pipeline. For each data family below: how raw signal is actually captured, the hardware behind it, the exact tensor schema it is stored in, and the loss the policy computes against it. Read this section if you intend to build a collection rig, an annotation pipeline, or a new VLA loss.

30–1000 Hz

Sampling rate

< 5 ms

Clock-drift bar

65–85%

Episodes kept after QA

16–32

Action chunk length H

Real teleop · manipulation

Human-in-the-loop joint and end-effector capture

A trained operator drives the robot through the task while every sensor stream is logged at a fixed rate. The labels are not annotated after the fact — they are the operator's commands.

Collection pipeline

01 · Scene resetA scripted reset places objects within calibrated bounds (tracked via fiducials or an overhead camera). Every episode must be replayable; randomized initial poses are recorded.
02 · Operator interfaceOperator wears VR (Quest, Vision Pro) or holds a leader arm (ALOHA, GELLO). Leader joints stream at 50–1000 Hz into an inverse-kinematics or direct-joint mapper.
03 · Synchronized recordingA central clock (PTP or ROS 2 message_filters) timestamps RGB(-D), wrist cams, joint encoders, gripper width, F/T sensors, and the operator's command. Drift < 5 ms is the standard bar.
04 · Action loggingTwo streams are logged: target (commanded) and achieved (measured). Policies train on commanded actions; achieved is for diagnostics and inverse-dynamics fallbacks.
05 · Language pairingAn instruction is spoken or typed once per episode (or per sub-segment). Whisper / hand-transcription produces the final string; templated re-phrasings (10–30×) are generated by an LLM for robustness.
06 · QA & curationEpisodes are auto-filtered on success (force spikes, gripper closure timing, end-effector pose vs. target). A second pass scores instruction faithfulness; ≈15–35% of raw episodes are discarded.

Hardware

Arm: Franka Panda, UR5e, WidowX 250, ARX-5, AgileX Cobot Magic. Cameras: 2–4 × RealSense D435/D455, ZED 2i, or Logitech BRIO at 30 Hz, plus 1–2 wrist cams. Teleop: leader arm (ALOHA), VR controllers (Quest 3, Vision Pro), or 3D SpaceMouse. Compute: Jetson Orin or a tethered workstation (RTX 4090) running the ROS / LCM bus. Cost envelope: $8k–$50k per station; $50–$300 of usable data per operator-hour after QA.

Data representation

Each step t stores { o_t, a_t, ℓ } where o_t = (I_t^cam, q_t, g_t, F_t) bundles RGB(-D) tensors, joint positions, gripper width, and (optionally) force. Actions are stored in the robot's native space — Δ-EEF in SE(3), absolute joint, or joint velocity — never silently converted. Per-dataset 1st/99th-percentile quantile normalization maps each action dimension to [-1, 1].

Loss · Flow matching with action chunking

How the gradient is computed

L = E_{τ, ε}  ‖ v_θ( a_τ , o_t , ℓ , τ ) − ( a_1 − a_0 ) ‖²

The policy predicts a chunk A_t = (a_t, …, a_{t+H-1}) of H=16–32 future actions. A noise sample a_τ = (1−τ)a_0 + τa_1 is drawn with τ ∼ Sigmoid-Normal(μ=−0.4, σ=1); the network regresses the velocity field. At rollout, 5–10 Euler steps from τ=0→1 reconstruct the chunk; the first k=4–8 actions are executed before re-planning. Plain BC simplifies this to L = ‖ π_θ(o,ℓ) − a* ‖²; ACT replaces it with chunked L1 plus a VAE prior; Diffusion Policy uses the equivalent DDPM noise-prediction loss.

Mobile manipulation

Whole-body teleop with base + arms in one frame

Same idea as tabletop teleop, but the chassis is now part of the action. The harder problem is keeping the base, arms, and head referenced in a single coordinate frame as the robot moves through a building.

Collection pipeline

01 · Body-frame calibrationExtrinsics between base, torso, arms, head, and external cameras are calibrated once per session against an AprilTag rig. Base odometry drift is bounded by SLAM or motion capture.
02 · Whole-body teleopOperator sits on a follower trolley (Mobile ALOHA) or in a haptic exo-suit; base velocity, torso pitch, and two-arm joints are streamed together.
03 · Multi-clock fusionBase odometry (≈50 Hz), arm encoders (≈500 Hz), and head camera (30 Hz) are PTP-synced and resampled to a 50 Hz canonical rate before storage.
04 · Long-horizon segmentationEpisodes are minutes long. They are split at language sub-instruction boundaries (“go to the fridge”, “open the door”, “grab the milk”) so the policy sees both atomic and composite chunks.
05 · Co-training rebalanceStatic-arm episodes from the same robot are mixed in 1:1–4:1 with mobile episodes; without this, the policy collapses to the more frequent static behavior (Mobile ALOHA finding).

Hardware

Platforms: Mobile ALOHA, AgiBot A2-D, Galaxea R1, AIRBOT MMK2, Tien Kung Pro, Unitree H1/G1. Sensors: head RGB-D (ZED 2i or RealSense D455), 2 × wrist cams, base IMU, wheel encoders or leg joint encoders, optional 2D/3D LiDAR. Compute: on-board Jetson Orin + workstation for recording. Cost envelope: $30k–$200k per platform.

Data representation

Action is the concatenation a_t = [v_base, ω_base, q_torso, q_arm^L, q_arm^R, q_hand^L, q_hand^R] — typically 22–56 DoF. Observations include a head RGB(-D), two wrist RGBs, joint state, base velocity, and a 2-second history. The instruction is hierarchical: a top-level command plus the current sub-instruction.

Loss · Chunked flow matching over the whole body

How the gradient is computed

L_wb = E_{τ}  ‖ v_θ( A_τ , o_{t-K:t} , ℓ_high , ℓ_sub , τ ) − ( A_1 − A_0 ) ‖²

Same flow-matching skeleton, but A_t ∈ ℝ^{H×D_wb} with D_wb up to 56. The base velocity dimensions are weighted ≈0.3× in the loss because their dynamic range is large and otherwise dominates the gradient. ACT-style training uses chunked L1 with an embodiment-aware prefix token.

Egocentric human video

Hand pose recovery and inverse-dynamics labelling

Egocentric video carries no native action labels. The collection pipeline is mostly a label-recovery pipeline: turn observed wrist and finger motion into pseudo-actions a policy can train on.

Collection pipeline

01 · CaptureVision Pro, Aria, GoPro Hero on a chest harness, or Quest 3. Vision Pro and Aria stream native 6-DoF wrist pose and per-finger joints; commodity cams need offline pose reconstruction.
02 · Hand & body poseHaMeR / WiLoR for monocular hand mesh, MANO for parametric fingers, SLAHMR / TRAM for body. Vision Pro's on-device tracker is the current gold standard for fingers.
03 · Scene anchoringCamera ego-pose from VIO (ARKit, Project Aria SLAM, or COLMAP) anchors hand pose in a world frame. Without this step, hand trajectories are uselessly camera-relative.
04 · RetargetingThe recovered 21-keypoint hand is retargeted onto a target gripper or 6-DoF hand via an optimization solver (dex-retargeting) that minimizes fingertip and palm error subject to joint limits.
05 · Pseudo-action extractionPseudo-actions are computed as finite differences on the retargeted wrist pose and joint angles, or via a learned inverse-dynamics model trained on a small paired (video, action) corpus.
06 · Narration & verbsFree-form narrations are aligned to clips (Ego4D / EPIC); LLM passes convert them into instruction-style imperatives matching the robot data style.

Hardware

Capture rigs: Apple Vision Pro (best fingers), Meta Project Aria (best multimodal), Quest 3, GoPro Hero 12 + chest harness, Insta360. Pipelines: ARKit hand tracking, HaMeR, WiLoR, MANO, SLAHMR, dex-retargeting. Cost envelope: a $3.5k Vision Pro can collect 20+ hours of dexterous data per day — three orders of magnitude cheaper per hour than teleop.

Data representation

Per frame: { I_t, T_wrist^{L,R}∈SE(3), q_finger^{L,R}∈ℝ^{15}, T_head∈SE(3), narration }. Pseudo-actions ã_t = IDM_φ(o_{t-K:t+H}) are produced either by finite differencing wrist pose or by a learned inverse-dynamics network φ.

Loss · Latent-action or IDM-supervised imitation

How the gradient is computed

L_ego = L_IDM( φ(o_{t-K:t+H}) , Δṗ_wrist ) + L_BC( π_θ(o,ℓ) , ã_t )

Two coupled losses. L_IDM trains the inverse-dynamics network on the small paired corpus where true actions exist. L_BC trains the policy on the much larger ego corpus using IDM-produced pseudo-actions. Latent-action variants (Genie-style) replace ã_t with a discrete code z_t = VQ(f(o_t, o_{t+1})) and predict that code instead, decoupling the policy from any single robot embodiment.

Navigation

Panoramic capture, expert paths, and instruction crowdsourcing

Navigation ground truth is two pieces: a 3D substrate (a scanned building) and a corpus of (instruction, path) pairs that humans have authored against that substrate.

Collection pipeline

01 · 3D scan the worldMatterport / Faro / iPhone-LiDAR captures of homes and offices produce textured meshes (Matterport3D, HM3D, Gibson, ScanNet++). Pano viewpoints are sampled on a navigability graph.
02 · Expert path generationAn A* or shortest-path planner produces a reference trajectory between two viewpoints, plus continuous waypoints for VLN-CE-style settings.
03 · Instruction authoringCrowd workers walk the path in a viewer and write a natural-language description. RxR additionally records timing so the instruction is aligned to motion segments.
04 · Multilingual & re-phrasingRxR collects English / Hindi / Telugu in parallel; modern pipelines also synthesize 5–20 paraphrases per instruction with an LLM, filtered for path entailment.
05 · Real-robot capture (driving)For GNM / ViNT / NoMaD, the substrate is replaced with hours of front-camera + odometry recordings across many wheeled platforms.

Hardware

Scanning: Matterport Pro2/3, Faro Focus, iPhone Pro LiDAR + Polycam. Simulators: Habitat 3.0, AI2-THOR, iGibson, ManiSkill-Habitat. Real driving rigs: Jackal, LoCoBot, TurtleBot, Spot, custom golf carts with a single front RGB and wheel odometry.

Data representation

Episodes are { M, ℓ, (v_0…v_T) } where M is the scene mesh,ℓ is the instruction, and viewpoints carry pano RGB(-D) plus a heading. Continuous variants store SE(2) waypoints at ≈4 Hz. Unified VLAs reduce this to an 8-waypoint chunk W_t = (Δx, Δy, Δθ)_{1:8} per decision step.

Loss · Discrete-action XE or waypoint flow matching

How the gradient is computed

L_nav = − Σ log p_θ( a_t | o_t, ℓ )   /   L_wp = E_{τ}  ‖ v_θ(W_τ, o_t, ℓ, τ) − (W_1 − W_0) ‖²

Classic VLN models output a discrete choice over the navigability graph (cross-entropy). Continuous and unified VLA settings predict the 8-waypoint chunk under the same flow-matching loss used for manipulation. Auxiliary terms include a progress monitor L_prog = ‖ p̂_t − p_t^* ‖ and a stop-classifier head.

Simulation & synthetic

Procedural scenes, motion planners, and rendering at scale

Synthetic ground truth is generated, not collected. The pipeline replaces the human operator with a planner or an RL agent and replaces the camera with a renderer.

Collection pipeline

01 · Asset & scene genProcedural scene authors (RoboCasa, BEHAVIOR-1K, ProcTHOR) place objects from PartNet / Objaverse-XL into physics-valid layouts with randomized lighting, textures, and clutter.
02 · Task specificationEach task is defined by an initial state distribution, a goal predicate (e.g. on(cup, tray)), and a success function. PDDL-style task graphs cover long-horizon settings.
03 · Trajectory generationAn OMPL / cuRobo motion planner or an RL agent (PPO with dense reward) solves the task; only successful rollouts are kept. Domain randomization perturbs textures, lighting, friction, and object scale.
04 · RenderingIsaac Sim, MuJoCo MJX, SAPIEN, or PyBullet renders RGB(-D) at the robot's eye. High-end variants ray-trace via Omniverse RTX or Gaussian-splat real scenes for photoreal evaluation.
05 · Vision-suppressed T2AFor text-to-action pretraining, the image is intentionally dropped, leaving (instruction, action chunk) pairs. The Qwen-VLA T2A ablation shows this beats vision-conditioned synthetic by +10.2 pp.

Hardware

Pure software stack — but the bottleneck is GPUs: Isaac Sim on RTX 4090 / A100; MuJoCo MJX on TPU or H100 for massively parallel rollouts (10k+ envs). A single H100 can generate ≈100k trajectories per day for tabletop tasks.

Data representation

Identical schema to teleop: { o_t, a_t, ℓ, success, randomization_params }. The extra randomization metadata is what makes sim-to-real domain adaptation tractable. For RL, the buffer also stores (r_t, v_t, log π_old).

Loss · PPO with GAE on sparse success reward

How the gradient is computed

L_PPO = E[ min( r_t Â_t , clip(r_t, 1±ε) Â_t ) ] − c_v ‖ V_θ − R̂ ‖² + c_H H[π_θ]

With r_t = π_θ(a_t|o_t) / π_old(a_t|o_t) and Â_t = Σ (γλ)^k δ_{t+k} (GAE). The reward is typically binary on success plus shaping. Imitation-pretrained checkpoints are warm-started, then PPO closes the gap between “mimics the demo” and “actually succeeds” — Qwen-VLA and π₀-RL both follow this recipe.

Auxiliary VL data

Box-and-caption annotation that keeps the backbone literate

Auxiliary VL data is collected with classical crowdsourcing — bounding boxes, referring expressions, dense captions, driving QA — and exists to stop the action loss from quietly destroying the VLM's language ability.

Collection pipeline

01 · Source imagesMined from COCO, OpenImages, Visual Genome, LAION, driving logs (nuScenes, Waymo), or robot-camera frames sampled from teleop runs.
02 · AnnotationMechanical-Turk-style pipelines collect boxes, referring expressions (RefCOCO), region captions, or VQA pairs. Driving sets add HD-map and trajectory ground truth from offline auto-labeling.
03 · Fine-grained action captionsThe newest and most surgical slice: a 1–3 s robot clip is densely captioned (“rotate clockwise 30°, then slide left 4 cm”) by trained annotators, producing the supervision that disambiguates ambiguous coarse labels.
04 · Quality filteringSpans are CLIP-scored, deduped, and LLM-rewritten for stylistic consistency with downstream instruction formats.

Hardware

No special hardware — the cost is purely human-hours. Annotation tools: CVAT, Label Studio, Scale, Surge.

Data representation

Standard VL pairs: { I, text }, optionally with bounding boxes b ∈ ℝ^4 serialized into the text as <box>x0 y0 x1 y1</box>. For driving: (I, ℓ, future-trajectory).

Loss · Next-token language-modelling loss

How the gradient is computed

L_VL = − Σ_t  log p_θ( y_t | y_{<t}, I )

Plain causal-LM cross-entropy on the textual targets. The auxiliary VL loss is added to the action loss with a small weight (≈0.1–0.3) so the backbone is continually reminded how to read the world while the action head is being trained. Without it, after ≈20k steps the VLM's grounding capabilities visibly degrade.

§ 13A worked example

How one frontier VLA spends its tokens.

The Qwen-VLA pretraining mixture is a reasonable proxy for what a modern unified VLA looks like in 2026. Three quarters of the budget goes to manipulation trajectories. The remaining quarter is the strategically interesting part — every slice is there for a measurable reason.

Qwen-VLA pretraining mixture

Source: arXiv:2605.30280 · Table 1

100.0%

Robot manipulation trajectories74.2%
Navigation trajectories7.5%
Egocentric human trajectories6.0%
Synthetic simulation (ours)3.7%
General vision-language data3.4%
Spatial grounding (2D)2.5%
Autonomous-driving VQA2.4%
Fine-grained action captions0.2%

Why egocentric is only 6%.

Not because it isn't valuable, but because dexterous ego-video supervision still needs an inverse-dynamics or hand-pose pipeline to convert into action labels. As EgoDex / EgoVerse standardize, expect this slice to triple within a year.

Why action captions are 0.2%.

They are surgical, not bulk. Their job is to disambiguate the cases where the same coarse label (“pick up the bowl”) maps to two valid motions. A small slice forces the model to ground action sequences in dense, ordered language.

Why driving VQA at all.

Driving datasets are the only mature source of long-horizon trajectory-centric supervision: ego pose, lane-relative position, future waypoints. They make the same flow-matching head work for autonomous driving with no architectural change.

Why ≈74% manipulation.

The bar for usable physical realism is still set by teleop. Every other family is being added to extend the policy beyond what teleop can cover — never to replace it. That ratio is unlikely to flip before 2028.

§ 14Benchmarks

How the field reports a number.

These are the suites a 2026 unified VLA is expected to publish on. Numbers below are reported results for Qwen-VLA-Instruct (arXiv 2605.30280). They are useful less as a leaderboard and more as a map of what the field currently measures — and where it still does not.

Suite	Embodiment	What it measures	SOTA	Note
LIBERO	Single Franka	4 long-horizon suites (Spatial / Object / Goal / Long)	97.9%	Effectively saturated by leading VLAs.
Simpler-WidowX	WidowX	Real-to-sim, aligned with real WidowX	73.7%	The honest sim benchmark — correlates with hardware.
RoboCasa-GR1	Bimanual humanoid (GR-1)	24 atomic kitchen tasks	—	Best probe of household generalization.
RoboTwin 2.0 (Easy/Hard)	Dual-arm	50 bimanual tasks	86.1 / 87.2%	Hard tier still exposes long-horizon coordination failures.
R2R (OSR)	Mobile (Matterport3D)	Vision-and-Language Navigation	69.0%	Discrete-graph instruction following.
RxR (SR)	Mobile	Multilingual VLN, longer paths	59.6%	Dense, time-aligned instructions in 3 languages.
ALOHA real-world OOD	Bimanual ALOHA	Out-of-distribution real-world	76.9%	Best honest measure of real-world generalization.
DOMINO (zero-shot)	Single-arm	Dynamic manipulation, zero-shot	26.6%	Frontier is still very far from solved.

§ 15Further reading

Primary sources.

If you build one thing after reading this, build a pretraining mixture. These are the papers and dataset pages to read first.

2026
Qwen-VLA: Unifying Vision-Language-Action Modeling
Qwen Team
The unified action-and-trajectory framework this guide is built around.
2024
Open X-Embodiment
DeepMind + 21 institutions
The schema that made cross-embodiment pretraining tractable.
2024
DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset
Khazatsky et al.
76k Franka demos collected across 18 institutions.
2024
Mobile ALOHA
Fu, Zhao, Finn
The wheeled-bimanual platform that proved 50 demos can be enough.
2026
EgoDex (Apple)
Hoque et al.
829 hours of Vision Pro dexterous egocentric data.
2023
Diffusion Policy
Chi et al.
The denoising-as-policy paper that reset the field.
2024–25
π₀ / π₀.7
Physical Intelligence
Flow-matching VLA recipe that most modern systems echo.
2025–26
DreamGen / DreamDojo / DreamZero
NVIDIA
Where video world models meet robot policies.
2025
RLinf
Yu et al.
The PPO framework Qwen-VLA uses for closed-loop RL on simulator success.
2018–20
R2R / RxR / VLN-CE
Anderson et al.; Ku et al.
The foundational VLN benchmark trio.

The datasets that teach robots to move, grasp, and find their way home.

RT-1 · Robotics Transformer

DROID · 76k Franka demos

Mobile ALOHA · Stanford

Ego4D · Meta AI

Egocentric-10K · Build AI

BridgeData V2

Open X-Embodiment Explorer

AgiBot World Colosseo

EPIC-KITCHENS-100

EgoExo4D

Matterport3D / HM3D viewer

Room-Across-Room (RxR)

LIBERO benchmark suite

RoboCasa kitchens