Field Notes on Embodied Data
A field guideNavigationManipulationMobile manipulationRefs Qwen-VLA · 2026

The datasets that teach robots to move, grasp, and find their way home.

Modern robot policies are not really models. They are data products. This is a complete catalog of the datasets that go into them — split by what they actually contain (teleoperation, ego-video, synthetic rollouts, panoramic navigation) — paired with the specific learning techniques each kind of supervision unlocks. It is built around the data recipe behind Qwen-VLA (arXiv 2605.30280), one of the first unified models to train manipulation, mobile manipulation, and navigation in a single action-and-trajectory space.

5
Data families surveyed
40+
Datasets catalogued
13
Learning techniques
10k+
Hours behind a modern VLA
§ 00Executive summary

If you only read one section.

Manipulation has gone cross-embodiment.

The frontier moved from “130k Google episodes” (RT-1, 2022) to a million+ trajectories pooled across 20+ robots (Open X-Embodiment, AgiBot World, RoboMIND v2). The bet: one policy across every arm beats a per-robot specialist.

Mobile manipulation finally has data.

Mobile ALOHA, AgiBot A2-D, Galaxea R1, AIRBOT MMK2, TienKung. Whole-body bimanual teleop with both joint and dexterous-hand action spaces is now collected at scale.

Egocentric video is the real bet.

The bottleneck is robot teleop. The escape valve is human first-person video: Ego4D, EPIC-KITCHENS, EgoDex (829 hours on Apple Vision Pro), EgoVerse, Xperience. Together they will dwarf all robot data by 2027.

Navigation collapsed into VLA.

R2R, RxR, VLN-CE used to need dedicated waypoint models. Unified VLAs now treat navigation as an 8-waypoint chunk prediction — same loss, same decoder.

The loss has converged.

For continuous control, the field has converged on flow matching with action chunking. RL with PPO/GAE is layered on top to close the gap between imitation and closed-loop success.

Synthetic is no longer a hack.

InternData-A1, GR00T-X-Embodiment-Sim, and in-house pipelines now contribute 5–10% of mixtures. Critically, vision-free synthetic data turns out to teach the action prior better than vision-conditioned synthetic data.

§ 01A taxonomy

Five families of supervision.

Before listing datasets, fix the categories. A modern VLA is trained on a deliberate mixture of five data families. They differ in what signal they carry, not in what hardware shot them: a robot trajectory provides direct action labels, a YouTube clip does not. The mixture is the model.

Embodied datawhat a VLA eatsReal teleopSimulationEgocentric humanNavigationAuxiliary VLRT-1, BridgeV2DROID, RH20TAgiBot, GalaxeaLIBERO, SimplerRoboCasa, RoboTwinInternData-A1Ego4D, EPICEgoDex (Vision Pro)EgoVerse, XperienceR2R, RxR, VLN-CEHM3D, ObjectNavGNM / ViNT corporaAV: nuScenes, WaymoGrounding: RefCOCODense action captionsUnified action-and-trajectory space
Fig. 01 — Five data families collapsing into one shared output space.
Real teleop

Trajectories captured by a human teleoperating a real robot. Direct, action-labelled, expensive (≈$50–$300 per hour of usable data). Defines the bar for physical realism.

Simulation

Trajectories produced by a planner or RL agent in a renderer. Cheap, infinitely re-rolled, but always carries a sim-to-real gap that needs randomization and domain-adaptation work.

Egocentric human

First-person video of a human doing things. No native action labels; recovered via inverse dynamics, hand tracking, or latent action models. The only signal that scales like the internet.

Navigation

Long-horizon panoramic video paired with instructions or goal categories. Carries 3-DoF planar trajectories and weak semantic supervision.

Auxiliary VL

Anything that keeps the backbone literate: VQA, spatial grounding, action captions, driving VQA, dense embodied descriptions. Small slice of the mixture, outsized effect on instruction following.

§ 02Sample reel

What the data actually looks like.

Before the catalogue, the raw thing. Below: verified author-published teaser videos for the most-cited datasets in each family, a real cover sheet from Build AI's Egocentric-10K, and direct links to interactive sample browsers on every official project page. Open one and scrub through episodes before you trust the prose.

RT-1 · Robotics Transformer

Real manipulation

Official supplementary video from Google Robotics. Shows the policy executing 700+ tasks in kitchens and offices, with the corresponding teleop demonstrations beside each rollout.

Project page ↗

DROID · 76k Franka demos

Real manipulation

Author-narrated tour through the DROID corpus: 564 scenes, 86 tasks, three synchronized cameras, the same Franka Panda across 18 institutions.

Project page ↗

Mobile ALOHA · Stanford

Mobile manipulation

The video that recalibrated everyone's priors on whole-body teleop. Cooking shrimp, watering plants, riding elevators — all from ≈50 demos co-trained with static ALOHA data.

Project page ↗

Ego4D · Meta AI

Egocentric

Meta's official Ego4D launch reel: 3,670 hours of unscripted first-person video collected across 74 cities. The pretraining substrate behind almost every modern embodied vision encoder.

Project page ↗
Egocentric-10K · Build AI — official sample frames

Egocentric-10K · Build AI

Egocentric

Official cover sheet for the 10,000-hour factory egocentric corpus — the highest hand-visibility, highest active-manipulation density open ego dataset, captured on Build AI Gen 1 headsets at 1080p / 30 fps.

Hugging Face dataset card ↗
Sample sheet · open source ↗

BridgeData V2

Real manipulation

Browseable per-task sample grid — every one of the 60k WidowX trajectories has third-person + wrist-cam video plus a language label.

3rd-person RGBWrist camΔEEF actionsLang instruction
Open sample browser ↗
Sample sheet · open source ↗

Open X-Embodiment Explorer

Real manipulation

Interactive viewer across 22 embodiments. The fastest way to feel just how heterogeneous the cross-embodiment pool actually is.

22 embodiments60 datasetsUnified RLDS schema
Open sample browser ↗
Sample sheet · open source ↗

AgiBot World Colosseo

Mobile manipulation

Five fully-replicated home, retail and office environments captured with a 100-robot farm. Sample videos and per-task statistics on the OpenDriveLab page.

Bimanual humanoidDexterous hand100+ scenarios
Open sample browser ↗
Sample sheet · open source ↗

EPIC-KITCHENS-100

Egocentric

100 hours of densely-narrated first-person cooking. The interactive visualizer lets you scrub verb/noun annotations frame-by-frame.

First-person RGBVerb · noun90k action segments
Open sample browser ↗
Sample sheet · open source ↗

EgoExo4D

Egocentric

Same activity captured from one egocentric and four exocentric cameras simultaneously, paired with expert commentary. The bridge dataset for ego↔exo viewpoint transfer.

Ego + 4 exoGaze + IMUExpert narration
Open sample browser ↗
Sample sheet · open source ↗

Matterport3D / HM3D viewer

Navigation

Walk through the 1,000 building-scale scans that back R2R, RxR, ObjectNav and HM3D-Semantics in your browser.

Panoramic RGB-DMeshRoom semantics
Open sample browser ↗
Sample sheet · open source ↗

Room-Across-Room (RxR)

Navigation

Multilingual VLN: paths annotated with time-aligned spoken instructions in English, Hindi, and Telugu. The official site lets you replay any episode.

Pano RGBEN / HI / TEPose timing
Open sample browser ↗
Sample sheet · open source ↗

LIBERO benchmark suite

Simulation

Standardized lifelong-manipulation benchmark — Spatial / Object / Goal / Long suites with downloadable rollouts.

Franka sim130 tasksWrist + 3rd-person
Open sample browser ↗
Sample sheet · open source ↗

RoboCasa kitchens

Simulation

Procedurally generated photorealistic kitchens for the GR-1 humanoid. Sample task videos live on the project page.

Procedural scenes100+ tasksBimanual humanoid
Open sample browser ↗
§ 03Scale, side-by-side

Where the field is actually investing collection effort.

A log-scale comparison of the headline volumes for every dataset surveyed in this guide. Units are not interchangeable — a teleop hour and a YouTube hour cost three orders of magnitude apart and carry different supervision — but the chart maps where the field is currently putting its scaling pressure.

Order-of-magnitude scale
Log10 scale · units differ per family
ManipulationMobile manipEgocentricNavigationSimulationAuxiliary
LAION-5B
5.0 B pairs
AgiBot World
1 M traj
Open X-Embodiment
1 M+ traj
RDT-1B
≈1 M ep
RT-1
130 k ep
RxR
126 k instr
RH20T
110 k clips
RoboMIND v2
107 k traj
DROID
76 k demos
BridgeData V2
60 k traj
BEHAVIOR-1K
1k tasks · 50 scenes
BC-Z
26 k traj
R2R
21,567 instr
Egocentric-10K
10,000 h
Ego4D
3,670 h
EgoExo4D
1,286 h
HM3D
1,000 scans
EgoDex
829 h
Assembly101
513 h
Galaxea Open World
≈500 h
Mobile ALOHA
≈350 demos
HoloAssist
166 h
LIBERO
130 tasks
EPIC-KITCHENS-100
100 h
GNM / ViNT corpora
100+ h
RoboCasa
100+ tasks

Hours and trajectories are not the same currency — ten teleop hours can encode the same skill repertoire as a thousand passive video hours. Read the chart as a map of where the field is investing collection effort, not as a ranking of per-sample value.

§ 04Modality coverage

What signal each family natively ships.

A policy is only as multimodal as its data. This matrix shows, family by family, which sensor and label streams are typically present out-of-the-box. Read it as a checklist when designing a mixture: a gap here is a gap in your final policy unless another family fills it.

Modality coverage by family
What each family natively ships
Core supervisionCommonly shippedOccasional
RGB video
Multi-view
Wrist cam
Depth / RGB-D
LiDAR / mesh
Joint angles
EEF pose / Δ
Gripper / DH
Force-torque
Audio
Hand pose
Eye gaze / IMU
Panoramic RGB
Language
Waypoints / traj
Real teleop
Mobile teleop
Egocentric
Navigation
Simulation
Auxiliary VL
§ 05Manipulation · real robots

Teleoperation is still the gold standard.

Robot manipulation trajectories form ≈74% of the Qwen-VLA pretraining mixture and a similar share of every serious VLA. The supervision is the cleanest you can buy: synchronized multi-view RGB, a language instruction, and a chunk of future actions in the robot's native control convention. The price is the catch: a single high-quality teleop hour costs more than a thousand hours of YouTube.

RT-1

Google · 2022

Real manipulation

First demonstration that a single Transformer policy could absorb hundreds of skills across kitchens and offices when fed enough teleoperated data.

Scale
130k episodes / 17 months
Embodiment
Mobile manipulator (Everyday Robots)
RGBLangΔEEFGripper

BridgeData V2

UC Berkeley · 2024

Real manipulation

Open, low-cost teleop corpus designed for cross-task generalization on a cheap arm. The de-facto benchmark for low-budget VLA research.

Scale
60k trajectories · 13 skills · 24 environments
Embodiment
WidowX 250 (single-arm)
RGBWrist camLangΔEEF

DROID

Stanford / 18 labs · 2024

Real manipulation

Distributed-collection consortium dataset capturing the same robot across 18 institutions, dorms, kitchens, labs. Designed explicitly for scene diversity.

Scale
76k demonstrations · 564 scenes · 86 tasks
Embodiment
Franka Panda (single-arm)
3× RGBDepthLangΔEEF / Joint

RH20T

SJTU · 2023

Real manipulation

One-shot generalist data with synchronized multi-view RGB-D, contact, audio, and language — built so policies can imitate from a single human-shown demo.

Scale
≈110k clips · 147 tasks · 7 robot types
Embodiment
Multi-arm
RGB-DAudioForce-torqueLang

BC-Z

Google / Stanford · 2022

Real manipulation

Showed that zero-shot task generalization is possible when teleoperation data is paired with language and human video task descriptions.

Scale
≈26k trajectories · 100 tasks
Embodiment
7-DoF arm
RGBLanguage or video task spec

RoboMIND v1/v2

Beijing Academy of AI · 2025

Real manipulation

Multi-embodiment benchmark dataset emphasizing standardized task taxonomies and failure-case annotations across heterogeneous platforms.

Scale
v2: 107k+ trajectories · 479 tasks · 96 object classes
Embodiment
Franka, AgileX, Tien Kung, UR-5e
Multi-view RGBDepthLangJointEEF

AgiBot World

AgiBot · 2025

Real manipulation

Largest open bimanual humanoid corpus to date — five fully replicated home/retail environments collected with a 100-robot farm.

Scale
1M+ trajectories · 217 tasks · 100+ scenarios
Embodiment
AgiBot A2-D (bimanual humanoid)
RGBDepthJointDexterous handLang

Galaxea Open World

Galaxea AI · 2025

Real manipulation

Bimanual mobile teleoperation in homes and offices, distributed in a unified observation-action schema for cross-embodiment training.

Scale
≈500 hours
Embodiment
Galaxea R1 (bimanual mobile)
RGBJointLangGripper

RDT-1B

Tsinghua · 2025

Real manipulation

Curated cross-embodiment bimanual aggregate behind the RDT-1B diffusion policy — emphasizes physically aligned bimanual coordination.

Scale
≈1M episodes (aggregated)
Embodiment
Bimanual
RGBLangJoint

RoboCOIN

Multi-institution · 2025

Real manipulation

An open consortium release pooling teleop and simulation across labs into a single, license-clean training pool.

Scale
Multi-platform composite
Embodiment
Heterogeneous
RGBLangAction

RobotSet

CMU / Berkeley · 2023

Real manipulation

Early generalist dataset designed for skill transfer studies across morphologies; an ancestor of Open X-Embodiment style pooling.

Scale
Aggregated
Embodiment
Multiple
RGBLang

Open X-Embodiment

DeepMind + 21 institutions · 2024

Real manipulation

The unifying schema that made cross-embodiment training tractable. Most modern VLA pretraining recipes draw a large share of their tokens from this pool.

Scale
1M+ trajectories · 22 embodiments · 60 datasets
Embodiment
Cross-embodiment
RGBLangVarious action spaces
§ 06Mobile manipulation

The whole body is the action space.

Mobile manipulation data is fundamentally different from tabletop teleop because the base is now part of the action. The policy has to coordinate a 3-DoF chassis with two arms (and often two dexterous hands) over minutes, not seconds. Datasets here are smaller, harder to collect, and the most strategically important — they are what general-purpose home robots will be trained on.

Mobile ALOHA

Stanford · 2024

Mobile manipulation

Low-cost wheeled bimanual teleop platform that proved a few dozen demos plus co-training with static ALOHA data unlock cooking, laundry, and elevator-riding.

Scale
≈50 demos × 7 long-horizon tasks
Embodiment
Wheeled bimanual
RGBJointWhole-body teleop

AgiBot A2-D Trajectories

AgiBot · 2025

Mobile manipulation

Whole-body bimanual data with a wheeled base; supports both absolute joint and dexterous-hand action spaces.

Scale
Subset of AgiBot World
Embodiment
Bimanual humanoid + chassis
RGBJointDexterous hand

Galaxea R1

Galaxea AI · 2025

Mobile manipulation

A commercial bimanual mobile platform whose dataset prioritizes home and storefront tasks with reproducible layouts.

Scale
Hundreds of hours
Embodiment
Bimanual mobile
RGBLangJoint

AIRBOT MMK2

Discover Robotics · 2025

Mobile manipulation

Cheap mobile bimanual with dexterous hands — a popular choice for academic mobile-manipulation studies.

Scale
Open release
Embodiment
Bimanual mobile + dexterous hands
RGBJointDH

TienKung

Beijing X-Humanoid · 2025

Mobile manipulation

Full-body humanoid trajectories with both gripper and dexterous-hand action spaces, useful for embodiment-aware co-training.

Scale
Open subset
Embodiment
Humanoid (bipedal)
RGBJointDH

BEHAVIOR-1K

Stanford · 2024

Mobile manipulation

Simulation-side mobile manipulation: a thousand annotated household activities, useful when paired with sim-to-real navigation pretraining.

Scale
1,000 activities · 50 scenes
Embodiment
Mobile manipulator (sim)
RGB-DLangWhole-body
§ 07Egocentric human data

The data source that scales like the web.

A single Vision Pro session can collect more dexterous hours in a weekend than an entire teleop lab does in a year. Egocentric human video carries no native action labels, so the field has invested heavily in pipelines that recover frame-wise 3D hand and finger pose — the input to inverse-dynamics models or latent action models that produce pseudo-action supervision. Recent VLAs allocate 6–10% of their mixture here, and the number is climbing.

Ego4D

FAIR + 14 universities · 2022

Egocentric human

The foundational egocentric-video pretraining set. Used to seed almost every modern visual encoder for embodiment (R3M, MVP, VC-1).

Scale
3,670 hours · 855 participants · 74 cities
Egocentric RGBAudio3D scanNarration

EPIC-KITCHENS-100

Bristol / Toronto / Catania · 2022

Egocentric human

Dense first-person kitchen activity with verb/noun grounding; the canonical benchmark for fine-grained manipulation understanding.

Scale
100 hours · 700 sessions · 90k action segments
Egocentric RGBNarrationAction verbs

EgoDex

Apple · 2026

Egocentric human

Largest dexterous egocentric set captured on Vision Pro — paired 3D hand-and-finger tracking finally makes human video usable as direct action supervision.

Scale
829 hours · 194 tabletop tasks
Egocentric RGBApple Vision Pro hand trackingFinger pose

EgoVerse

Open consortium · 2026

Egocentric human

A collaborative ego platform built to ship in one unified format so labs can pool data without per-dataset wrappers.

Scale
1,300+ hours · 1,965 tasks · 240 scenes
Egocentric RGBHand poseStandardized schema

Xperience

Ropedia · 2026

Egocentric human

Synchronized first-person multimodal recordings with hierarchical instruction annotations — designed as the next-generation Ego4D successor.

Scale
Large-scale multimodal
RGBDepthHand & body mocapHierarchical lang

Egocentric-10K

Build AI · 2025

Egocentric human

The largest egocentric corpus to date and the first collected exclusively inside real factories. State-of-the-art in hand visibility and active-manipulation density — built specifically as pretraining fuel for industrial VLA and dexterous policies.

Scale
10,000 hours · 1.08B frames · 192.9k clips · 16.4 TB
Embodiment
Head-mounted (Build AI Gen 1)
1080p RGB @30fps128° FoVCamera intrinsics

Egocentric-10K-Evaluation

Build AI · 2025

Egocentric human

Held-out evaluation slice of Egocentric-10K with dense annotations — used to score hand detection, contact, and active-object grounding in factory settings.

Scale
30,000 annotated frames
RGBHand / contact annotations

HoloAssist

Microsoft · 2023

Egocentric human

Mixed-reality egocentric capture of real assembly/repair tasks with synchronized instructor speech — the canonical procedural-assistance dataset for embodied LLMs.

Scale
166 hours · 222 instructor–performer pairs · 20 object procedures
HoloLens 2 RGBDepthIMUEye gazeHand poseDialogue

Assembly101

Meta / TUM / Singapore · 2022

Egocentric human

Multi-view (ego + exo) procedural assembly with fine-grained mistake annotations; widely used for action segmentation and error detection.

Scale
513 hours · 4,321 videos · 101 toy assemblies
8× fixed RGB4× egocentric RGB3D hand pose

EgoExo4D

FAIR + 15 universities · 2024

Egocentric human

Paired egocentric + exocentric capture of skilled activities (cooking, sports, repair) with expert narration — the bridge dataset for ego↔exo viewpoint transfer.

Scale
1,286 hours · 740 participants · 13 cities
Ego RGB4–5 exo RGBGazeIMUExpert commentary
§ 09Simulation & synthetic

The infinite, imperfect oracle.

Simulation is no longer a backup. It now serves three distinct roles: a benchmark (LIBERO, Simpler, RoboCasa, RoboTwin), a source of pretraining trajectories (InternData-A1, GR00T-X-Embodiment-Sim), and the rollout environment for reinforcement learning on top of an imitation-pretrained policy. The Qwen-VLA T2A ablation contains a striking result: ≈20% synthetic mixed with 80% real, with vision suppressed,beats every other ratio by ~10 points downstream.

LIBERO

UT Austin · 2023

Simulated manipulation

Lifelong learning benchmark with four axes: Spatial, Object, Goal, Long. The standard simulation report card for VLAs.

Scale
4 suites · 130 long-horizon tasks
Embodiment
Single Franka
RGBWrist camLangΔEEF

SimplerEnv

Stanford / Google · 2024

Simulated manipulation

A real-to-sim suite engineered so simulator success correlates with real-robot success — used widely as a cheap proxy for hardware evaluation.

Scale
Aligned to real WidowX / Google Robot
Embodiment
WidowX, Google Robot
RGBLangΔEEF

RoboCasa

UT Austin / NVIDIA · 2024

Simulated manipulation

Procedurally generated kitchens with photorealistic assets, designed for everyday household manipulation evaluation.

Scale
100+ kitchen tasks · 120 scenes
Embodiment
Mobile + bimanual humanoid (GR-1)
RGBLangJoint

RoboTwin 2.0

Open community · 2025

Simulated manipulation

A dual-arm benchmark with a careful difficulty split that exposes failure modes in long-horizon bimanual coordination.

Scale
50 bimanual tasks · Easy & Hard tiers
Embodiment
Dual-arm
RGBLangJoint

InternData-A1

Shanghai AI Lab · 2025

Simulated manipulation

Simulation trajectories generated by motion planners in diverse virtual scenes — used to widen long-tail object and layout coverage.

Scale
Large-scale planner trajectories
Embodiment
Multiple
RGBLangAction

GR00T-X-Embodiment-Sim

NVIDIA · 2025

Simulated manipulation

Synthetic counterpart to GR00T's training stack: procedurally varied scenes rendered across many embodiments to seed a universal policy.

Scale
Cross-embodiment synthetic
Embodiment
Multiple
RGBJointLang

DOMINO

Open community · 2026

Simulated manipulation

Zero-shot evaluation of dynamic skills (pouring, sliding, throwing) where contact dynamics dominate — the hardest known generalization probe.

Scale
Dynamic manipulation suite
Embodiment
Single-arm
RGBLang
§ 10Auxiliary vision-language

What keeps the backbone literate.

Auxiliary VL data — driving VQA, 2D spatial grounding, fine-grained action captions, general image-text — is a small fraction of a typical mixture (≈8.5% in Qwen-VLA) but does outsized work: it stops the action-loss gradient from quietly destroying the VLM's language understanding, and it is the only place the model learns the dense vocabulary needed for fine-grained instructions like “rotate clockwise, then slide left.”

nuScenes / Waymo Open / Argoverse 2

Various · 2019–23

Auxiliary

Autonomous-driving VQA and motion-forecasting corpora feed trajectory-centric supervision into general embodied models.

Scale
Thousands of driving scenes
RGBLiDARHD mapsTrajectories

RefCOCO / RefCOCOg / Visual Genome

Various · 2014–17

Auxiliary

2D spatial-grounding data that keeps a VLA backbone literate in “the red mug on the left.”

Scale
Millions of region-phrase pairs
RGBBounding boxesLang

LAION / DataComp / OBELICS

LAION / community · 2022–24

Auxiliary

The general vision-language substrate used during continual pretraining so the action model does not forget how to read the world.

Scale
Billions of image-text pairs
RGBLangInterleaved

Fine-grained Embodied Captions

Curated · 2025–26

Auxiliary

Dense action-level captions (“rotate clockwise, then slide left”) that disambiguate the same coarse label collapsing to two different motions.

Scale
≈0.2% of mixtures — small but critical
RGB clipDense action description
§ 11Learning techniques

One table to match data to method.

Every dataset above implies a learning technique. Real teleop wants behavior cloning or flow matching. Ego video wants inverse dynamics or a latent action model. Simulation rollouts want PPO. Below: the techniques that matter in 2026, what they consume, and what each is uniquely good at.

TechniqueData it consumesLossWhy it matters
Behavior Cloning (BC)
Supervised mimicry
Best forPlenty of clean teleop, single embodiment
Real teleop trajectories (RT-1, BridgeData V2, DROID)MSE / cross-entropy on actionsThe starting point for almost every manipulation policy. Cheap, but compounds errors out of distribution. Used as the warm-up for everything below.
Action Chunking + Transformer (ACT)
Predict an action chunk, not one step
Best forBimanual & contact-rich tasks (ALOHA, Mobile ALOHA)
ALOHA & RoboMIND-style synchronized bimanualL1 over chunked actions + VAE priorPredicts H actions at once and re-plans every k steps. The single most important architectural trick for high-frequency bimanual policies.
Diffusion Policy
Denoise the next action chunk
Best forMultimodal action distributions, dexterous tasks
Mixed teleop with multiple valid solutionsDDPM / DDIM noise predictionTreats the chunk of future actions as an image to denoise. Captures the multi-modal nature of human teleoperation that L2 losses average away.
Flow Matching for Actions
Continuous-time denoising decoder
Best forVLA action experts (π₀, Qwen-VLA, GR00T)
Cross-embodiment continuous controlVelocity-field regression with Beta / Sigmoid-Normal timestep priorsThe new default. Cheaper than diffusion at inference, and lets a vision-language backbone attach a small DiT action expert that consumes language tokens directly.
Vision-Language-Action Pretraining
VLM + action head, one model
Best forCross-task, cross-embodiment generalization
Open X-Embodiment + sim + ego videoNext-token LM loss + flow-matching action lossRT-2, OpenVLA, π₀, Qwen-VLA. The big idea: keep a pretrained VLM literate, tack on an action expert, supervise with both losses simultaneously.
Text-to-Action (T2A) Pretraining
Learn actions from language alone, no images
Best forEstablishing an action prior before visual grounding
Synthetic + real trajectories with images droppedFlow matching with Sigmoid-Normal τ-scheduleQwen-VLA shows that pretraining the action decoder on language-conditioned trajectories — vision suppressed — beats no-T2A by +10.2 pp downstream. Forces the decoder to ground in language, not visual shortcuts.
Embodiment-Aware Prompt Conditioning
Tell the model which robot it is
Best forMulti-robot, multi-action-space training
Any cross-embodiment mixturePrepend a textual description of the platform, arm count, control frequency, and action space. Removes the need for embodiment-specific heads and enables zero-shot transfer to new robots.
Per-Dataset Quantile Normalization
Scale-free action targets
Best forPooling heterogeneous teleop sources
Mixed real-robot trajectoriesEach dataset's action dimensions are mapped to [-1, 1] using the 1st/99th quantiles per source. Removes scale differences across embodiments without losing relative motion structure.
Inverse Dynamics + Pseudo-Actions
Recover actions from video
Best forEgocentric human video, action-less footage
Ego4D, EPIC-KITCHENS, YouTubeFinite differences on proprioception or learned IDMMost ego data ships without explicit actions. Frame-wise hand pose plus an inverse-dynamics network produces pseudo-action labels that train policies as if a human were the teleoperator.
Latent Action Models (LAMs)
Discover an action space from video
Best forWeb video at scale
Unlabeled internet video (Genie-style)Reconstruction of next frame from latentCompress “what changed between two frames” into a small vector, then learn a policy that emits those latents. Used by Genie 2/3 and several robot world-model pipelines to unify human video with robot actions.
World-Model Imitation
Train policies inside a learned simulator
Best forSample-efficient RL, policy evaluation
Video + action pairs (DreamGen, UniSim, DreamerV3)RL or imitation inside imagined rolloutsPretrain on human video, fine-tune on a small robot dataset, then practice in imagined rollouts. The current frontier for closing the data gap between teleop and reality.
RL with PPO / GAE on Sparse Rewards
Optimize for closed-loop success
Best forPushing SFT checkpoints past imitation ceilings
On-policy rollouts in simulationPPO clipped surrogate + value headLikelihood-based SFT teaches the policy to imitate; RL teaches it to succeed. Qwen-VLA, π₀-RL, and HIL-SERL all use PPO/GAE on simulator success signals.
Vision-and-Language Navigation Imitation
Predict waypoints from instruction + history
Best forR2R, RxR, VLN-CE
Pano video + instruction transcriptsCross-entropy on discrete actions or flow matching on waypointsModern unified models treat VLN as just another action-and-trajectory prediction problem: 8 future waypoints per chunk, supervised the same way as manipulation.
§ 12Ground-truth collection

From sensor stream to gradient step.

A dataset is the output of a pipeline. For each data family below: how raw signal is actually captured, the hardware behind it, the exact tensor schema it is stored in, and the loss the policy computes against it. Read this section if you intend to build a collection rig, an annotation pipeline, or a new VLA loss.

30–1000 Hz
Sampling rate
< 5 ms
Clock-drift bar
65–85%
Episodes kept after QA
16–32
Action chunk length H
Real teleop · manipulation

Human-in-the-loop joint and end-effector capture

A trained operator drives the robot through the task while every sensor stream is logged at a fixed rate. The labels are not annotated after the fact — they are the operator's commands.

Collection pipeline
  1. 01 · Scene resetA scripted reset places objects within calibrated bounds (tracked via fiducials or an overhead camera). Every episode must be replayable; randomized initial poses are recorded.
  2. 02 · Operator interfaceOperator wears VR (Quest, Vision Pro) or holds a leader arm (ALOHA, GELLO). Leader joints stream at 50–1000 Hz into an inverse-kinematics or direct-joint mapper.
  3. 03 · Synchronized recordingA central clock (PTP or ROS 2 message_filters) timestamps RGB(-D), wrist cams, joint encoders, gripper width, F/T sensors, and the operator's command. Drift < 5 ms is the standard bar.
  4. 04 · Action loggingTwo streams are logged: target (commanded) and achieved (measured). Policies train on commanded actions; achieved is for diagnostics and inverse-dynamics fallbacks.
  5. 05 · Language pairingAn instruction is spoken or typed once per episode (or per sub-segment). Whisper / hand-transcription produces the final string; templated re-phrasings (10–30×) are generated by an LLM for robustness.
  6. 06 · QA & curationEpisodes are auto-filtered on success (force spikes, gripper closure timing, end-effector pose vs. target). A second pass scores instruction faithfulness; ≈15–35% of raw episodes are discarded.
Hardware

Arm: Franka Panda, UR5e, WidowX 250, ARX-5, AgileX Cobot Magic. Cameras: 2–4 × RealSense D435/D455, ZED 2i, or Logitech BRIO at 30 Hz, plus 1–2 wrist cams. Teleop: leader arm (ALOHA), VR controllers (Quest 3, Vision Pro), or 3D SpaceMouse. Compute: Jetson Orin or a tethered workstation (RTX 4090) running the ROS / LCM bus. Cost envelope: $8k–$50k per station; $50–$300 of usable data per operator-hour after QA.

Data representation

Each step t stores { o_t, a_t, ℓ } where o_t = (I_t^cam, q_t, g_t, F_t) bundles RGB(-D) tensors, joint positions, gripper width, and (optionally) force. Actions are stored in the robot's native space — Δ-EEF in SE(3), absolute joint, or joint velocity — never silently converted. Per-dataset 1st/99th-percentile quantile normalization maps each action dimension to [-1, 1].

Loss · Flow matching with action chunking
How the gradient is computed
L = E_{τ, ε}  ‖ v_θ( a_τ , o_t , ℓ , τ ) − ( a_1 − a_0 ) ‖²

The policy predicts a chunk A_t = (a_t, …, a_{t+H-1}) of H=16–32 future actions. A noise sample a_τ = (1−τ)a_0 + τa_1 is drawn with τ ∼ Sigmoid-Normal(μ=−0.4, σ=1); the network regresses the velocity field. At rollout, 5–10 Euler steps from τ=0→1 reconstruct the chunk; the first k=4–8 actions are executed before re-planning. Plain BC simplifies this to L = ‖ π_θ(o,ℓ) − a* ‖²; ACT replaces it with chunked L1 plus a VAE prior; Diffusion Policy uses the equivalent DDPM noise-prediction loss.

Mobile manipulation

Whole-body teleop with base + arms in one frame

Same idea as tabletop teleop, but the chassis is now part of the action. The harder problem is keeping the base, arms, and head referenced in a single coordinate frame as the robot moves through a building.

Collection pipeline
  1. 01 · Body-frame calibrationExtrinsics between base, torso, arms, head, and external cameras are calibrated once per session against an AprilTag rig. Base odometry drift is bounded by SLAM or motion capture.
  2. 02 · Whole-body teleopOperator sits on a follower trolley (Mobile ALOHA) or in a haptic exo-suit; base velocity, torso pitch, and two-arm joints are streamed together.
  3. 03 · Multi-clock fusionBase odometry (≈50 Hz), arm encoders (≈500 Hz), and head camera (30 Hz) are PTP-synced and resampled to a 50 Hz canonical rate before storage.
  4. 04 · Long-horizon segmentationEpisodes are minutes long. They are split at language sub-instruction boundaries (“go to the fridge”, “open the door”, “grab the milk”) so the policy sees both atomic and composite chunks.
  5. 05 · Co-training rebalanceStatic-arm episodes from the same robot are mixed in 1:1–4:1 with mobile episodes; without this, the policy collapses to the more frequent static behavior (Mobile ALOHA finding).
Hardware

Platforms: Mobile ALOHA, AgiBot A2-D, Galaxea R1, AIRBOT MMK2, Tien Kung Pro, Unitree H1/G1. Sensors: head RGB-D (ZED 2i or RealSense D455), 2 × wrist cams, base IMU, wheel encoders or leg joint encoders, optional 2D/3D LiDAR. Compute: on-board Jetson Orin + workstation for recording. Cost envelope: $30k–$200k per platform.

Data representation

Action is the concatenation a_t = [v_base, ω_base, q_torso, q_arm^L, q_arm^R, q_hand^L, q_hand^R] — typically 22–56 DoF. Observations include a head RGB(-D), two wrist RGBs, joint state, base velocity, and a 2-second history. The instruction is hierarchical: a top-level command plus the current sub-instruction.

Loss · Chunked flow matching over the whole body
How the gradient is computed
L_wb = E_{τ}  ‖ v_θ( A_τ , o_{t-K:t} , ℓ_high , ℓ_sub , τ ) − ( A_1 − A_0 ) ‖²

Same flow-matching skeleton, but A_t ∈ ℝ^{H×D_wb} with D_wb up to 56. The base velocity dimensions are weighted ≈0.3× in the loss because their dynamic range is large and otherwise dominates the gradient. ACT-style training uses chunked L1 with an embodiment-aware prefix token.

Egocentric human video

Hand pose recovery and inverse-dynamics labelling

Egocentric video carries no native action labels. The collection pipeline is mostly a label-recovery pipeline: turn observed wrist and finger motion into pseudo-actions a policy can train on.

Collection pipeline
  1. 01 · CaptureVision Pro, Aria, GoPro Hero on a chest harness, or Quest 3. Vision Pro and Aria stream native 6-DoF wrist pose and per-finger joints; commodity cams need offline pose reconstruction.
  2. 02 · Hand & body poseHaMeR / WiLoR for monocular hand mesh, MANO for parametric fingers, SLAHMR / TRAM for body. Vision Pro's on-device tracker is the current gold standard for fingers.
  3. 03 · Scene anchoringCamera ego-pose from VIO (ARKit, Project Aria SLAM, or COLMAP) anchors hand pose in a world frame. Without this step, hand trajectories are uselessly camera-relative.
  4. 04 · RetargetingThe recovered 21-keypoint hand is retargeted onto a target gripper or 6-DoF hand via an optimization solver (dex-retargeting) that minimizes fingertip and palm error subject to joint limits.
  5. 05 · Pseudo-action extractionPseudo-actions are computed as finite differences on the retargeted wrist pose and joint angles, or via a learned inverse-dynamics model trained on a small paired (video, action) corpus.
  6. 06 · Narration & verbsFree-form narrations are aligned to clips (Ego4D / EPIC); LLM passes convert them into instruction-style imperatives matching the robot data style.
Hardware

Capture rigs: Apple Vision Pro (best fingers), Meta Project Aria (best multimodal), Quest 3, GoPro Hero 12 + chest harness, Insta360. Pipelines: ARKit hand tracking, HaMeR, WiLoR, MANO, SLAHMR, dex-retargeting. Cost envelope: a $3.5k Vision Pro can collect 20+ hours of dexterous data per day — three orders of magnitude cheaper per hour than teleop.

Data representation

Per frame: { I_t, T_wrist^{L,R}∈SE(3), q_finger^{L,R}∈ℝ^{15}, T_head∈SE(3), narration }. Pseudo-actions ã_t = IDM_φ(o_{t-K:t+H}) are produced either by finite differencing wrist pose or by a learned inverse-dynamics network φ.

Loss · Latent-action or IDM-supervised imitation
How the gradient is computed
L_ego = L_IDM( φ(o_{t-K:t+H}) , Δṗ_wrist ) + L_BC( π_θ(o,ℓ) , ã_t )

Two coupled losses. L_IDM trains the inverse-dynamics network on the small paired corpus where true actions exist. L_BC trains the policy on the much larger ego corpus using IDM-produced pseudo-actions. Latent-action variants (Genie-style) replace ã_t with a discrete code z_t = VQ(f(o_t, o_{t+1})) and predict that code instead, decoupling the policy from any single robot embodiment.

Navigation

Panoramic capture, expert paths, and instruction crowdsourcing

Navigation ground truth is two pieces: a 3D substrate (a scanned building) and a corpus of (instruction, path) pairs that humans have authored against that substrate.

Collection pipeline
  1. 01 · 3D scan the worldMatterport / Faro / iPhone-LiDAR captures of homes and offices produce textured meshes (Matterport3D, HM3D, Gibson, ScanNet++). Pano viewpoints are sampled on a navigability graph.
  2. 02 · Expert path generationAn A* or shortest-path planner produces a reference trajectory between two viewpoints, plus continuous waypoints for VLN-CE-style settings.
  3. 03 · Instruction authoringCrowd workers walk the path in a viewer and write a natural-language description. RxR additionally records timing so the instruction is aligned to motion segments.
  4. 04 · Multilingual & re-phrasingRxR collects English / Hindi / Telugu in parallel; modern pipelines also synthesize 5–20 paraphrases per instruction with an LLM, filtered for path entailment.
  5. 05 · Real-robot capture (driving)For GNM / ViNT / NoMaD, the substrate is replaced with hours of front-camera + odometry recordings across many wheeled platforms.
Hardware

Scanning: Matterport Pro2/3, Faro Focus, iPhone Pro LiDAR + Polycam. Simulators: Habitat 3.0, AI2-THOR, iGibson, ManiSkill-Habitat. Real driving rigs: Jackal, LoCoBot, TurtleBot, Spot, custom golf carts with a single front RGB and wheel odometry.

Data representation

Episodes are { M, ℓ, (v_0…v_T) } where M is the scene mesh, is the instruction, and viewpoints carry pano RGB(-D) plus a heading. Continuous variants store SE(2) waypoints at ≈4 Hz. Unified VLAs reduce this to an 8-waypoint chunk W_t = (Δx, Δy, Δθ)_{1:8} per decision step.

Loss · Discrete-action XE or waypoint flow matching
How the gradient is computed
L_nav = − Σ log p_θ( a_t | o_t, ℓ )   /   L_wp = E_{τ}  ‖ v_θ(W_τ, o_t, ℓ, τ) − (W_1 − W_0) ‖²

Classic VLN models output a discrete choice over the navigability graph (cross-entropy). Continuous and unified VLA settings predict the 8-waypoint chunk under the same flow-matching loss used for manipulation. Auxiliary terms include a progress monitor L_prog = ‖ p̂_t − p_t^* ‖ and a stop-classifier head.

Simulation & synthetic

Procedural scenes, motion planners, and rendering at scale

Synthetic ground truth is generated, not collected. The pipeline replaces the human operator with a planner or an RL agent and replaces the camera with a renderer.

Collection pipeline
  1. 01 · Asset & scene genProcedural scene authors (RoboCasa, BEHAVIOR-1K, ProcTHOR) place objects from PartNet / Objaverse-XL into physics-valid layouts with randomized lighting, textures, and clutter.
  2. 02 · Task specificationEach task is defined by an initial state distribution, a goal predicate (e.g. on(cup, tray)), and a success function. PDDL-style task graphs cover long-horizon settings.
  3. 03 · Trajectory generationAn OMPL / cuRobo motion planner or an RL agent (PPO with dense reward) solves the task; only successful rollouts are kept. Domain randomization perturbs textures, lighting, friction, and object scale.
  4. 04 · RenderingIsaac Sim, MuJoCo MJX, SAPIEN, or PyBullet renders RGB(-D) at the robot's eye. High-end variants ray-trace via Omniverse RTX or Gaussian-splat real scenes for photoreal evaluation.
  5. 05 · Vision-suppressed T2AFor text-to-action pretraining, the image is intentionally dropped, leaving (instruction, action chunk) pairs. The Qwen-VLA T2A ablation shows this beats vision-conditioned synthetic by +10.2 pp.
Hardware

Pure software stack — but the bottleneck is GPUs: Isaac Sim on RTX 4090 / A100; MuJoCo MJX on TPU or H100 for massively parallel rollouts (10k+ envs). A single H100 can generate ≈100k trajectories per day for tabletop tasks.

Data representation

Identical schema to teleop: { o_t, a_t, ℓ, success, randomization_params }. The extra randomization metadata is what makes sim-to-real domain adaptation tractable. For RL, the buffer also stores (r_t, v_t, log π_old).

Loss · PPO with GAE on sparse success reward
How the gradient is computed
L_PPO = E[ min( r_t Â_t , clip(r_t, 1±ε) Â_t ) ] − c_v ‖ V_θ − R̂ ‖² + c_H H[π_θ]

With r_t = π_θ(a_t|o_t) / π_old(a_t|o_t) and Â_t = Σ (γλ)^k δ_{t+k} (GAE). The reward is typically binary on success plus shaping. Imitation-pretrained checkpoints are warm-started, then PPO closes the gap between “mimics the demo” and “actually succeeds” — Qwen-VLA and π₀-RL both follow this recipe.

Auxiliary VL data

Box-and-caption annotation that keeps the backbone literate

Auxiliary VL data is collected with classical crowdsourcing — bounding boxes, referring expressions, dense captions, driving QA — and exists to stop the action loss from quietly destroying the VLM's language ability.

Collection pipeline
  1. 01 · Source imagesMined from COCO, OpenImages, Visual Genome, LAION, driving logs (nuScenes, Waymo), or robot-camera frames sampled from teleop runs.
  2. 02 · AnnotationMechanical-Turk-style pipelines collect boxes, referring expressions (RefCOCO), region captions, or VQA pairs. Driving sets add HD-map and trajectory ground truth from offline auto-labeling.
  3. 03 · Fine-grained action captionsThe newest and most surgical slice: a 1–3 s robot clip is densely captioned (“rotate clockwise 30°, then slide left 4 cm”) by trained annotators, producing the supervision that disambiguates ambiguous coarse labels.
  4. 04 · Quality filteringSpans are CLIP-scored, deduped, and LLM-rewritten for stylistic consistency with downstream instruction formats.
Hardware

No special hardware — the cost is purely human-hours. Annotation tools: CVAT, Label Studio, Scale, Surge.

Data representation

Standard VL pairs: { I, text }, optionally with bounding boxes b ∈ ℝ^4 serialized into the text as <box>x0 y0 x1 y1</box>. For driving: (I, ℓ, future-trajectory).

Loss · Next-token language-modelling loss
How the gradient is computed
L_VL = − Σ_t  log p_θ( y_t | y_{<t}, I )

Plain causal-LM cross-entropy on the textual targets. The auxiliary VL loss is added to the action loss with a small weight (≈0.1–0.3) so the backbone is continually reminded how to read the world while the action head is being trained. Without it, after ≈20k steps the VLM's grounding capabilities visibly degrade.

§ 13A worked example

How one frontier VLA spends its tokens.

The Qwen-VLA pretraining mixture is a reasonable proxy for what a modern unified VLA looks like in 2026. Three quarters of the budget goes to manipulation trajectories. The remaining quarter is the strategically interesting part — every slice is there for a measurable reason.

Qwen-VLA pretraining mixture
Source: arXiv:2605.30280 · Table 1
100.0%
  • Robot manipulation trajectories74.2%
  • Navigation trajectories7.5%
  • Egocentric human trajectories6.0%
  • Synthetic simulation (ours)3.7%
  • General vision-language data3.4%
  • Spatial grounding (2D)2.5%
  • Autonomous-driving VQA2.4%
  • Fine-grained action captions0.2%
Why egocentric is only 6%.

Not because it isn't valuable, but because dexterous ego-video supervision still needs an inverse-dynamics or hand-pose pipeline to convert into action labels. As EgoDex / EgoVerse standardize, expect this slice to triple within a year.

Why action captions are 0.2%.

They are surgical, not bulk. Their job is to disambiguate the cases where the same coarse label (“pick up the bowl”) maps to two valid motions. A small slice forces the model to ground action sequences in dense, ordered language.

Why driving VQA at all.

Driving datasets are the only mature source of long-horizon trajectory-centric supervision: ego pose, lane-relative position, future waypoints. They make the same flow-matching head work for autonomous driving with no architectural change.

Why ≈74% manipulation.

The bar for usable physical realism is still set by teleop. Every other family is being added to extend the policy beyond what teleop can cover — never to replace it. That ratio is unlikely to flip before 2028.

§ 14Benchmarks

How the field reports a number.

These are the suites a 2026 unified VLA is expected to publish on. Numbers below are reported results for Qwen-VLA-Instruct (arXiv 2605.30280). They are useful less as a leaderboard and more as a map of what the field currently measures — and where it still does not.

SuiteEmbodimentWhat it measuresSOTANote
LIBEROSingle Franka4 long-horizon suites (Spatial / Object / Goal / Long)97.9%Effectively saturated by leading VLAs.
Simpler-WidowXWidowXReal-to-sim, aligned with real WidowX73.7%The honest sim benchmark — correlates with hardware.
RoboCasa-GR1Bimanual humanoid (GR-1)24 atomic kitchen tasksBest probe of household generalization.
RoboTwin 2.0 (Easy/Hard)Dual-arm50 bimanual tasks86.1 / 87.2%Hard tier still exposes long-horizon coordination failures.
R2R (OSR)Mobile (Matterport3D)Vision-and-Language Navigation69.0%Discrete-graph instruction following.
RxR (SR)MobileMultilingual VLN, longer paths59.6%Dense, time-aligned instructions in 3 languages.
ALOHA real-world OODBimanual ALOHAOut-of-distribution real-world76.9%Best honest measure of real-world generalization.
DOMINO (zero-shot)Single-armDynamic manipulation, zero-shot26.6%Frontier is still very far from solved.
§ 15Further reading

Primary sources.

If you build one thing after reading this, build a pretraining mixture. These are the papers and dataset pages to read first.