If you only read one section.
The frontier moved from “130k Google episodes” (RT-1, 2022) to a million+ trajectories pooled across 20+ robots (Open X-Embodiment, AgiBot World, RoboMIND v2). The bet: one policy across every arm beats a per-robot specialist.
Mobile ALOHA, AgiBot A2-D, Galaxea R1, AIRBOT MMK2, TienKung. Whole-body bimanual teleop with both joint and dexterous-hand action spaces is now collected at scale.
The bottleneck is robot teleop. The escape valve is human first-person video: Ego4D, EPIC-KITCHENS, EgoDex (829 hours on Apple Vision Pro), EgoVerse, Xperience. Together they will dwarf all robot data by 2027.
R2R, RxR, VLN-CE used to need dedicated waypoint models. Unified VLAs now treat navigation as an 8-waypoint chunk prediction — same loss, same decoder.
For continuous control, the field has converged on flow matching with action chunking. RL with PPO/GAE is layered on top to close the gap between imitation and closed-loop success.
InternData-A1, GR00T-X-Embodiment-Sim, and in-house pipelines now contribute 5–10% of mixtures. Critically, vision-free synthetic data turns out to teach the action prior better than vision-conditioned synthetic data.
Five families of supervision.
Before listing datasets, fix the categories. A modern VLA is trained on a deliberate mixture of five data families. They differ in what signal they carry, not in what hardware shot them: a robot trajectory provides direct action labels, a YouTube clip does not. The mixture is the model.
Trajectories captured by a human teleoperating a real robot. Direct, action-labelled, expensive (≈$50–$300 per hour of usable data). Defines the bar for physical realism.
Trajectories produced by a planner or RL agent in a renderer. Cheap, infinitely re-rolled, but always carries a sim-to-real gap that needs randomization and domain-adaptation work.
First-person video of a human doing things. No native action labels; recovered via inverse dynamics, hand tracking, or latent action models. The only signal that scales like the internet.
Long-horizon panoramic video paired with instructions or goal categories. Carries 3-DoF planar trajectories and weak semantic supervision.
Anything that keeps the backbone literate: VQA, spatial grounding, action captions, driving VQA, dense embodied descriptions. Small slice of the mixture, outsized effect on instruction following.
What the data actually looks like.
Before the catalogue, the raw thing. Below: verified author-published teaser videos for the most-cited datasets in each family, a real cover sheet from Build AI's Egocentric-10K, and direct links to interactive sample browsers on every official project page. Open one and scrub through episodes before you trust the prose.
RT-1 · Robotics Transformer
Real manipulationOfficial supplementary video from Google Robotics. Shows the policy executing 700+ tasks in kitchens and offices, with the corresponding teleop demonstrations beside each rollout.
Project page ↗DROID · 76k Franka demos
Real manipulationAuthor-narrated tour through the DROID corpus: 564 scenes, 86 tasks, three synchronized cameras, the same Franka Panda across 18 institutions.
Project page ↗Mobile ALOHA · Stanford
Mobile manipulationThe video that recalibrated everyone's priors on whole-body teleop. Cooking shrimp, watering plants, riding elevators — all from ≈50 demos co-trained with static ALOHA data.
Project page ↗Ego4D · Meta AI
EgocentricMeta's official Ego4D launch reel: 3,670 hours of unscripted first-person video collected across 74 cities. The pretraining substrate behind almost every modern embodied vision encoder.
Project page ↗
Egocentric-10K · Build AI
EgocentricOfficial cover sheet for the 10,000-hour factory egocentric corpus — the highest hand-visibility, highest active-manipulation density open ego dataset, captured on Build AI Gen 1 headsets at 1080p / 30 fps.
Hugging Face dataset card ↗BridgeData V2
Real manipulationBrowseable per-task sample grid — every one of the 60k WidowX trajectories has third-person + wrist-cam video plus a language label.
Open X-Embodiment Explorer
Real manipulationInteractive viewer across 22 embodiments. The fastest way to feel just how heterogeneous the cross-embodiment pool actually is.
AgiBot World Colosseo
Mobile manipulationFive fully-replicated home, retail and office environments captured with a 100-robot farm. Sample videos and per-task statistics on the OpenDriveLab page.
EPIC-KITCHENS-100
Egocentric100 hours of densely-narrated first-person cooking. The interactive visualizer lets you scrub verb/noun annotations frame-by-frame.
EgoExo4D
EgocentricSame activity captured from one egocentric and four exocentric cameras simultaneously, paired with expert commentary. The bridge dataset for ego↔exo viewpoint transfer.
Matterport3D / HM3D viewer
NavigationWalk through the 1,000 building-scale scans that back R2R, RxR, ObjectNav and HM3D-Semantics in your browser.
Room-Across-Room (RxR)
NavigationMultilingual VLN: paths annotated with time-aligned spoken instructions in English, Hindi, and Telugu. The official site lets you replay any episode.
LIBERO benchmark suite
SimulationStandardized lifelong-manipulation benchmark — Spatial / Object / Goal / Long suites with downloadable rollouts.
RoboCasa kitchens
SimulationProcedurally generated photorealistic kitchens for the GR-1 humanoid. Sample task videos live on the project page.
Where the field is actually investing collection effort.
A log-scale comparison of the headline volumes for every dataset surveyed in this guide. Units are not interchangeable — a teleop hour and a YouTube hour cost three orders of magnitude apart and carry different supervision — but the chart maps where the field is currently putting its scaling pressure.
Hours and trajectories are not the same currency — ten teleop hours can encode the same skill repertoire as a thousand passive video hours. Read the chart as a map of where the field is investing collection effort, not as a ranking of per-sample value.
What signal each family natively ships.
A policy is only as multimodal as its data. This matrix shows, family by family, which sensor and label streams are typically present out-of-the-box. Read it as a checklist when designing a mixture: a gap here is a gap in your final policy unless another family fills it.
Teleoperation is still the gold standard.
Robot manipulation trajectories form ≈74% of the Qwen-VLA pretraining mixture and a similar share of every serious VLA. The supervision is the cleanest you can buy: synchronized multi-view RGB, a language instruction, and a chunk of future actions in the robot's native control convention. The price is the catch: a single high-quality teleop hour costs more than a thousand hours of YouTube.
RT-1
Google · 2022
First demonstration that a single Transformer policy could absorb hundreds of skills across kitchens and offices when fed enough teleoperated data.
BridgeData V2
UC Berkeley · 2024
Open, low-cost teleop corpus designed for cross-task generalization on a cheap arm. The de-facto benchmark for low-budget VLA research.
DROID
Stanford / 18 labs · 2024
Distributed-collection consortium dataset capturing the same robot across 18 institutions, dorms, kitchens, labs. Designed explicitly for scene diversity.
RH20T
SJTU · 2023
One-shot generalist data with synchronized multi-view RGB-D, contact, audio, and language — built so policies can imitate from a single human-shown demo.
BC-Z
Google / Stanford · 2022
Showed that zero-shot task generalization is possible when teleoperation data is paired with language and human video task descriptions.
RoboMIND v1/v2
Beijing Academy of AI · 2025
Multi-embodiment benchmark dataset emphasizing standardized task taxonomies and failure-case annotations across heterogeneous platforms.
AgiBot World
AgiBot · 2025
Largest open bimanual humanoid corpus to date — five fully replicated home/retail environments collected with a 100-robot farm.
Galaxea Open World
Galaxea AI · 2025
Bimanual mobile teleoperation in homes and offices, distributed in a unified observation-action schema for cross-embodiment training.
RDT-1B
Tsinghua · 2025
Curated cross-embodiment bimanual aggregate behind the RDT-1B diffusion policy — emphasizes physically aligned bimanual coordination.
RoboCOIN
Multi-institution · 2025
An open consortium release pooling teleop and simulation across labs into a single, license-clean training pool.
RobotSet
CMU / Berkeley · 2023
Early generalist dataset designed for skill transfer studies across morphologies; an ancestor of Open X-Embodiment style pooling.
Open X-Embodiment
DeepMind + 21 institutions · 2024
The unifying schema that made cross-embodiment training tractable. Most modern VLA pretraining recipes draw a large share of their tokens from this pool.
The whole body is the action space.
Mobile manipulation data is fundamentally different from tabletop teleop because the base is now part of the action. The policy has to coordinate a 3-DoF chassis with two arms (and often two dexterous hands) over minutes, not seconds. Datasets here are smaller, harder to collect, and the most strategically important — they are what general-purpose home robots will be trained on.
Mobile ALOHA
Stanford · 2024
Low-cost wheeled bimanual teleop platform that proved a few dozen demos plus co-training with static ALOHA data unlock cooking, laundry, and elevator-riding.
AgiBot A2-D Trajectories
AgiBot · 2025
Whole-body bimanual data with a wheeled base; supports both absolute joint and dexterous-hand action spaces.
Galaxea R1
Galaxea AI · 2025
A commercial bimanual mobile platform whose dataset prioritizes home and storefront tasks with reproducible layouts.
AIRBOT MMK2
Discover Robotics · 2025
Cheap mobile bimanual with dexterous hands — a popular choice for academic mobile-manipulation studies.
TienKung
Beijing X-Humanoid · 2025
Full-body humanoid trajectories with both gripper and dexterous-hand action spaces, useful for embodiment-aware co-training.
BEHAVIOR-1K
Stanford · 2024
Simulation-side mobile manipulation: a thousand annotated household activities, useful when paired with sim-to-real navigation pretraining.
The data source that scales like the web.
A single Vision Pro session can collect more dexterous hours in a weekend than an entire teleop lab does in a year. Egocentric human video carries no native action labels, so the field has invested heavily in pipelines that recover frame-wise 3D hand and finger pose — the input to inverse-dynamics models or latent action models that produce pseudo-action supervision. Recent VLAs allocate 6–10% of their mixture here, and the number is climbing.
Ego4D
FAIR + 14 universities · 2022
The foundational egocentric-video pretraining set. Used to seed almost every modern visual encoder for embodiment (R3M, MVP, VC-1).
EPIC-KITCHENS-100
Bristol / Toronto / Catania · 2022
Dense first-person kitchen activity with verb/noun grounding; the canonical benchmark for fine-grained manipulation understanding.
EgoDex
Apple · 2026
Largest dexterous egocentric set captured on Vision Pro — paired 3D hand-and-finger tracking finally makes human video usable as direct action supervision.
EgoVerse
Open consortium · 2026
A collaborative ego platform built to ship in one unified format so labs can pool data without per-dataset wrappers.
Xperience
Ropedia · 2026
Synchronized first-person multimodal recordings with hierarchical instruction annotations — designed as the next-generation Ego4D successor.
Egocentric-10K
Build AI · 2025
The largest egocentric corpus to date and the first collected exclusively inside real factories. State-of-the-art in hand visibility and active-manipulation density — built specifically as pretraining fuel for industrial VLA and dexterous policies.
Egocentric-10K-Evaluation
Build AI · 2025
Held-out evaluation slice of Egocentric-10K with dense annotations — used to score hand detection, contact, and active-object grounding in factory settings.
HoloAssist
Microsoft · 2023
Mixed-reality egocentric capture of real assembly/repair tasks with synchronized instructor speech — the canonical procedural-assistance dataset for embodied LLMs.
Assembly101
Meta / TUM / Singapore · 2022
Multi-view (ego + exo) procedural assembly with fine-grained mistake annotations; widely used for action segmentation and error detection.
EgoExo4D
FAIR + 15 universities · 2024
Paired egocentric + exocentric capture of skilled activities (cooking, sports, repair) with expert narration — the bridge dataset for ego↔exo viewpoint transfer.
The infinite, imperfect oracle.
Simulation is no longer a backup. It now serves three distinct roles: a benchmark (LIBERO, Simpler, RoboCasa, RoboTwin), a source of pretraining trajectories (InternData-A1, GR00T-X-Embodiment-Sim), and the rollout environment for reinforcement learning on top of an imitation-pretrained policy. The Qwen-VLA T2A ablation contains a striking result: ≈20% synthetic mixed with 80% real, with vision suppressed,beats every other ratio by ~10 points downstream.
LIBERO
UT Austin · 2023
Lifelong learning benchmark with four axes: Spatial, Object, Goal, Long. The standard simulation report card for VLAs.
SimplerEnv
Stanford / Google · 2024
A real-to-sim suite engineered so simulator success correlates with real-robot success — used widely as a cheap proxy for hardware evaluation.
RoboCasa
UT Austin / NVIDIA · 2024
Procedurally generated kitchens with photorealistic assets, designed for everyday household manipulation evaluation.
RoboTwin 2.0
Open community · 2025
A dual-arm benchmark with a careful difficulty split that exposes failure modes in long-horizon bimanual coordination.
InternData-A1
Shanghai AI Lab · 2025
Simulation trajectories generated by motion planners in diverse virtual scenes — used to widen long-tail object and layout coverage.
GR00T-X-Embodiment-Sim
NVIDIA · 2025
Synthetic counterpart to GR00T's training stack: procedurally varied scenes rendered across many embodiments to seed a universal policy.
DOMINO
Open community · 2026
Zero-shot evaluation of dynamic skills (pouring, sliding, throwing) where contact dynamics dominate — the hardest known generalization probe.
What keeps the backbone literate.
Auxiliary VL data — driving VQA, 2D spatial grounding, fine-grained action captions, general image-text — is a small fraction of a typical mixture (≈8.5% in Qwen-VLA) but does outsized work: it stops the action-loss gradient from quietly destroying the VLM's language understanding, and it is the only place the model learns the dense vocabulary needed for fine-grained instructions like “rotate clockwise, then slide left.”
nuScenes / Waymo Open / Argoverse 2
Various · 2019–23
Autonomous-driving VQA and motion-forecasting corpora feed trajectory-centric supervision into general embodied models.
RefCOCO / RefCOCOg / Visual Genome
Various · 2014–17
2D spatial-grounding data that keeps a VLA backbone literate in “the red mug on the left.”
LAION / DataComp / OBELICS
LAION / community · 2022–24
The general vision-language substrate used during continual pretraining so the action model does not forget how to read the world.
Fine-grained Embodied Captions
Curated · 2025–26
Dense action-level captions (“rotate clockwise, then slide left”) that disambiguate the same coarse label collapsing to two different motions.
One table to match data to method.
Every dataset above implies a learning technique. Real teleop wants behavior cloning or flow matching. Ego video wants inverse dynamics or a latent action model. Simulation rollouts want PPO. Below: the techniques that matter in 2026, what they consume, and what each is uniquely good at.
| Technique | Data it consumes | Loss | Why it matters |
|---|---|---|---|
Behavior Cloning (BC) Supervised mimicry Best forPlenty of clean teleop, single embodiment | Real teleop trajectories (RT-1, BridgeData V2, DROID) | MSE / cross-entropy on actions | The starting point for almost every manipulation policy. Cheap, but compounds errors out of distribution. Used as the warm-up for everything below. |
Action Chunking + Transformer (ACT) Predict an action chunk, not one step Best forBimanual & contact-rich tasks (ALOHA, Mobile ALOHA) | ALOHA & RoboMIND-style synchronized bimanual | L1 over chunked actions + VAE prior | Predicts H actions at once and re-plans every k steps. The single most important architectural trick for high-frequency bimanual policies. |
Diffusion Policy Denoise the next action chunk Best forMultimodal action distributions, dexterous tasks | Mixed teleop with multiple valid solutions | DDPM / DDIM noise prediction | Treats the chunk of future actions as an image to denoise. Captures the multi-modal nature of human teleoperation that L2 losses average away. |
Flow Matching for Actions Continuous-time denoising decoder Best forVLA action experts (π₀, Qwen-VLA, GR00T) | Cross-embodiment continuous control | Velocity-field regression with Beta / Sigmoid-Normal timestep priors | The new default. Cheaper than diffusion at inference, and lets a vision-language backbone attach a small DiT action expert that consumes language tokens directly. |
Vision-Language-Action Pretraining VLM + action head, one model Best forCross-task, cross-embodiment generalization | Open X-Embodiment + sim + ego video | Next-token LM loss + flow-matching action loss | RT-2, OpenVLA, π₀, Qwen-VLA. The big idea: keep a pretrained VLM literate, tack on an action expert, supervise with both losses simultaneously. |
Text-to-Action (T2A) Pretraining Learn actions from language alone, no images Best forEstablishing an action prior before visual grounding | Synthetic + real trajectories with images dropped | Flow matching with Sigmoid-Normal τ-schedule | Qwen-VLA shows that pretraining the action decoder on language-conditioned trajectories — vision suppressed — beats no-T2A by +10.2 pp downstream. Forces the decoder to ground in language, not visual shortcuts. |
Embodiment-Aware Prompt Conditioning Tell the model which robot it is Best forMulti-robot, multi-action-space training | Any cross-embodiment mixture | — | Prepend a textual description of the platform, arm count, control frequency, and action space. Removes the need for embodiment-specific heads and enables zero-shot transfer to new robots. |
Per-Dataset Quantile Normalization Scale-free action targets Best forPooling heterogeneous teleop sources | Mixed real-robot trajectories | — | Each dataset's action dimensions are mapped to [-1, 1] using the 1st/99th quantiles per source. Removes scale differences across embodiments without losing relative motion structure. |
Inverse Dynamics + Pseudo-Actions Recover actions from video Best forEgocentric human video, action-less footage | Ego4D, EPIC-KITCHENS, YouTube | Finite differences on proprioception or learned IDM | Most ego data ships without explicit actions. Frame-wise hand pose plus an inverse-dynamics network produces pseudo-action labels that train policies as if a human were the teleoperator. |
Latent Action Models (LAMs) Discover an action space from video Best forWeb video at scale | Unlabeled internet video (Genie-style) | Reconstruction of next frame from latent | Compress “what changed between two frames” into a small vector, then learn a policy that emits those latents. Used by Genie 2/3 and several robot world-model pipelines to unify human video with robot actions. |
World-Model Imitation Train policies inside a learned simulator Best forSample-efficient RL, policy evaluation | Video + action pairs (DreamGen, UniSim, DreamerV3) | RL or imitation inside imagined rollouts | Pretrain on human video, fine-tune on a small robot dataset, then practice in imagined rollouts. The current frontier for closing the data gap between teleop and reality. |
RL with PPO / GAE on Sparse Rewards Optimize for closed-loop success Best forPushing SFT checkpoints past imitation ceilings | On-policy rollouts in simulation | PPO clipped surrogate + value head | Likelihood-based SFT teaches the policy to imitate; RL teaches it to succeed. Qwen-VLA, π₀-RL, and HIL-SERL all use PPO/GAE on simulator success signals. |
Vision-and-Language Navigation Imitation Predict waypoints from instruction + history Best forR2R, RxR, VLN-CE | Pano video + instruction transcripts | Cross-entropy on discrete actions or flow matching on waypoints | Modern unified models treat VLN as just another action-and-trajectory prediction problem: 8 future waypoints per chunk, supervised the same way as manipulation. |
From sensor stream to gradient step.
A dataset is the output of a pipeline. For each data family below: how raw signal is actually captured, the hardware behind it, the exact tensor schema it is stored in, and the loss the policy computes against it. Read this section if you intend to build a collection rig, an annotation pipeline, or a new VLA loss.
Human-in-the-loop joint and end-effector capture
A trained operator drives the robot through the task while every sensor stream is logged at a fixed rate. The labels are not annotated after the fact — they are the operator's commands.
- 01 · Scene resetA scripted reset places objects within calibrated bounds (tracked via fiducials or an overhead camera). Every episode must be replayable; randomized initial poses are recorded.
- 02 · Operator interfaceOperator wears VR (Quest, Vision Pro) or holds a leader arm (ALOHA, GELLO). Leader joints stream at 50–1000 Hz into an inverse-kinematics or direct-joint mapper.
- 03 · Synchronized recordingA central clock (PTP or ROS 2 message_filters) timestamps RGB(-D), wrist cams, joint encoders, gripper width, F/T sensors, and the operator's command. Drift < 5 ms is the standard bar.
- 04 · Action loggingTwo streams are logged: target (commanded) and achieved (measured). Policies train on commanded actions; achieved is for diagnostics and inverse-dynamics fallbacks.
- 05 · Language pairingAn instruction is spoken or typed once per episode (or per sub-segment). Whisper / hand-transcription produces the final string; templated re-phrasings (10–30×) are generated by an LLM for robustness.
- 06 · QA & curationEpisodes are auto-filtered on success (force spikes, gripper closure timing, end-effector pose vs. target). A second pass scores instruction faithfulness; ≈15–35% of raw episodes are discarded.
Arm: Franka Panda, UR5e, WidowX 250, ARX-5, AgileX Cobot Magic. Cameras: 2–4 × RealSense D435/D455, ZED 2i, or Logitech BRIO at 30 Hz, plus 1–2 wrist cams. Teleop: leader arm (ALOHA), VR controllers (Quest 3, Vision Pro), or 3D SpaceMouse. Compute: Jetson Orin or a tethered workstation (RTX 4090) running the ROS / LCM bus. Cost envelope: $8k–$50k per station; $50–$300 of usable data per operator-hour after QA.
Each step t stores { o_t, a_t, ℓ } where o_t = (I_t^cam, q_t, g_t, F_t) bundles RGB(-D) tensors, joint positions, gripper width, and (optionally) force. Actions are stored in the robot's native space — Δ-EEF in SE(3), absolute joint, or joint velocity — never silently converted. Per-dataset 1st/99th-percentile quantile normalization maps each action dimension to [-1, 1].
L = E_{τ, ε} ‖ v_θ( a_τ , o_t , ℓ , τ ) − ( a_1 − a_0 ) ‖²The policy predicts a chunk A_t = (a_t, …, a_{t+H-1}) of H=16–32 future actions. A noise sample a_τ = (1−τ)a_0 + τa_1 is drawn with τ ∼ Sigmoid-Normal(μ=−0.4, σ=1); the network regresses the velocity field. At rollout, 5–10 Euler steps from τ=0→1 reconstruct the chunk; the first k=4–8 actions are executed before re-planning. Plain BC simplifies this to L = ‖ π_θ(o,ℓ) − a* ‖²; ACT replaces it with chunked L1 plus a VAE prior; Diffusion Policy uses the equivalent DDPM noise-prediction loss.
Whole-body teleop with base + arms in one frame
Same idea as tabletop teleop, but the chassis is now part of the action. The harder problem is keeping the base, arms, and head referenced in a single coordinate frame as the robot moves through a building.
- 01 · Body-frame calibrationExtrinsics between base, torso, arms, head, and external cameras are calibrated once per session against an AprilTag rig. Base odometry drift is bounded by SLAM or motion capture.
- 02 · Whole-body teleopOperator sits on a follower trolley (Mobile ALOHA) or in a haptic exo-suit; base velocity, torso pitch, and two-arm joints are streamed together.
- 03 · Multi-clock fusionBase odometry (≈50 Hz), arm encoders (≈500 Hz), and head camera (30 Hz) are PTP-synced and resampled to a 50 Hz canonical rate before storage.
- 04 · Long-horizon segmentationEpisodes are minutes long. They are split at language sub-instruction boundaries (“go to the fridge”, “open the door”, “grab the milk”) so the policy sees both atomic and composite chunks.
- 05 · Co-training rebalanceStatic-arm episodes from the same robot are mixed in 1:1–4:1 with mobile episodes; without this, the policy collapses to the more frequent static behavior (Mobile ALOHA finding).
Platforms: Mobile ALOHA, AgiBot A2-D, Galaxea R1, AIRBOT MMK2, Tien Kung Pro, Unitree H1/G1. Sensors: head RGB-D (ZED 2i or RealSense D455), 2 × wrist cams, base IMU, wheel encoders or leg joint encoders, optional 2D/3D LiDAR. Compute: on-board Jetson Orin + workstation for recording. Cost envelope: $30k–$200k per platform.
Action is the concatenation a_t = [v_base, ω_base, q_torso, q_arm^L, q_arm^R, q_hand^L, q_hand^R] — typically 22–56 DoF. Observations include a head RGB(-D), two wrist RGBs, joint state, base velocity, and a 2-second history. The instruction is hierarchical: a top-level command plus the current sub-instruction.
L_wb = E_{τ} ‖ v_θ( A_τ , o_{t-K:t} , ℓ_high , ℓ_sub , τ ) − ( A_1 − A_0 ) ‖²Same flow-matching skeleton, but A_t ∈ ℝ^{H×D_wb} with D_wb up to 56. The base velocity dimensions are weighted ≈0.3× in the loss because their dynamic range is large and otherwise dominates the gradient. ACT-style training uses chunked L1 with an embodiment-aware prefix token.
Hand pose recovery and inverse-dynamics labelling
Egocentric video carries no native action labels. The collection pipeline is mostly a label-recovery pipeline: turn observed wrist and finger motion into pseudo-actions a policy can train on.
- 01 · CaptureVision Pro, Aria, GoPro Hero on a chest harness, or Quest 3. Vision Pro and Aria stream native 6-DoF wrist pose and per-finger joints; commodity cams need offline pose reconstruction.
- 02 · Hand & body poseHaMeR / WiLoR for monocular hand mesh, MANO for parametric fingers, SLAHMR / TRAM for body. Vision Pro's on-device tracker is the current gold standard for fingers.
- 03 · Scene anchoringCamera ego-pose from VIO (ARKit, Project Aria SLAM, or COLMAP) anchors hand pose in a world frame. Without this step, hand trajectories are uselessly camera-relative.
- 04 · RetargetingThe recovered 21-keypoint hand is retargeted onto a target gripper or 6-DoF hand via an optimization solver (dex-retargeting) that minimizes fingertip and palm error subject to joint limits.
- 05 · Pseudo-action extractionPseudo-actions are computed as finite differences on the retargeted wrist pose and joint angles, or via a learned inverse-dynamics model trained on a small paired (video, action) corpus.
- 06 · Narration & verbsFree-form narrations are aligned to clips (Ego4D / EPIC); LLM passes convert them into instruction-style imperatives matching the robot data style.
Capture rigs: Apple Vision Pro (best fingers), Meta Project Aria (best multimodal), Quest 3, GoPro Hero 12 + chest harness, Insta360. Pipelines: ARKit hand tracking, HaMeR, WiLoR, MANO, SLAHMR, dex-retargeting. Cost envelope: a $3.5k Vision Pro can collect 20+ hours of dexterous data per day — three orders of magnitude cheaper per hour than teleop.
Per frame: { I_t, T_wrist^{L,R}∈SE(3), q_finger^{L,R}∈ℝ^{15}, T_head∈SE(3), narration }. Pseudo-actions ã_t = IDM_φ(o_{t-K:t+H}) are produced either by finite differencing wrist pose or by a learned inverse-dynamics network φ.
L_ego = L_IDM( φ(o_{t-K:t+H}) , Δṗ_wrist ) + L_BC( π_θ(o,ℓ) , ã_t )Two coupled losses. L_IDM trains the inverse-dynamics network on the small paired corpus where true actions exist. L_BC trains the policy on the much larger ego corpus using IDM-produced pseudo-actions. Latent-action variants (Genie-style) replace ã_t with a discrete code z_t = VQ(f(o_t, o_{t+1})) and predict that code instead, decoupling the policy from any single robot embodiment.
Panoramic capture, expert paths, and instruction crowdsourcing
Navigation ground truth is two pieces: a 3D substrate (a scanned building) and a corpus of (instruction, path) pairs that humans have authored against that substrate.
- 01 · 3D scan the worldMatterport / Faro / iPhone-LiDAR captures of homes and offices produce textured meshes (Matterport3D, HM3D, Gibson, ScanNet++). Pano viewpoints are sampled on a navigability graph.
- 02 · Expert path generationAn A* or shortest-path planner produces a reference trajectory between two viewpoints, plus continuous waypoints for VLN-CE-style settings.
- 03 · Instruction authoringCrowd workers walk the path in a viewer and write a natural-language description. RxR additionally records timing so the instruction is aligned to motion segments.
- 04 · Multilingual & re-phrasingRxR collects English / Hindi / Telugu in parallel; modern pipelines also synthesize 5–20 paraphrases per instruction with an LLM, filtered for path entailment.
- 05 · Real-robot capture (driving)For GNM / ViNT / NoMaD, the substrate is replaced with hours of front-camera + odometry recordings across many wheeled platforms.
Scanning: Matterport Pro2/3, Faro Focus, iPhone Pro LiDAR + Polycam. Simulators: Habitat 3.0, AI2-THOR, iGibson, ManiSkill-Habitat. Real driving rigs: Jackal, LoCoBot, TurtleBot, Spot, custom golf carts with a single front RGB and wheel odometry.
Episodes are { M, ℓ, (v_0…v_T) } where M is the scene mesh,ℓ is the instruction, and viewpoints carry pano RGB(-D) plus a heading. Continuous variants store SE(2) waypoints at ≈4 Hz. Unified VLAs reduce this to an 8-waypoint chunk W_t = (Δx, Δy, Δθ)_{1:8} per decision step.
L_nav = − Σ log p_θ( a_t | o_t, ℓ ) / L_wp = E_{τ} ‖ v_θ(W_τ, o_t, ℓ, τ) − (W_1 − W_0) ‖²Classic VLN models output a discrete choice over the navigability graph (cross-entropy). Continuous and unified VLA settings predict the 8-waypoint chunk under the same flow-matching loss used for manipulation. Auxiliary terms include a progress monitor L_prog = ‖ p̂_t − p_t^* ‖ and a stop-classifier head.
Procedural scenes, motion planners, and rendering at scale
Synthetic ground truth is generated, not collected. The pipeline replaces the human operator with a planner or an RL agent and replaces the camera with a renderer.
- 01 · Asset & scene genProcedural scene authors (RoboCasa, BEHAVIOR-1K, ProcTHOR) place objects from PartNet / Objaverse-XL into physics-valid layouts with randomized lighting, textures, and clutter.
- 02 · Task specificationEach task is defined by an initial state distribution, a goal predicate (e.g. on(cup, tray)), and a success function. PDDL-style task graphs cover long-horizon settings.
- 03 · Trajectory generationAn OMPL / cuRobo motion planner or an RL agent (PPO with dense reward) solves the task; only successful rollouts are kept. Domain randomization perturbs textures, lighting, friction, and object scale.
- 04 · RenderingIsaac Sim, MuJoCo MJX, SAPIEN, or PyBullet renders RGB(-D) at the robot's eye. High-end variants ray-trace via Omniverse RTX or Gaussian-splat real scenes for photoreal evaluation.
- 05 · Vision-suppressed T2AFor text-to-action pretraining, the image is intentionally dropped, leaving (instruction, action chunk) pairs. The Qwen-VLA T2A ablation shows this beats vision-conditioned synthetic by +10.2 pp.
Pure software stack — but the bottleneck is GPUs: Isaac Sim on RTX 4090 / A100; MuJoCo MJX on TPU or H100 for massively parallel rollouts (10k+ envs). A single H100 can generate ≈100k trajectories per day for tabletop tasks.
Identical schema to teleop: { o_t, a_t, ℓ, success, randomization_params }. The extra randomization metadata is what makes sim-to-real domain adaptation tractable. For RL, the buffer also stores (r_t, v_t, log π_old).
L_PPO = E[ min( r_t Â_t , clip(r_t, 1±ε) Â_t ) ] − c_v ‖ V_θ − R̂ ‖² + c_H H[π_θ]
With r_t = π_θ(a_t|o_t) / π_old(a_t|o_t) and Â_t = Σ (γλ)^k δ_{t+k} (GAE). The reward is typically binary on success plus shaping. Imitation-pretrained checkpoints are warm-started, then PPO closes the gap between “mimics the demo” and “actually succeeds” — Qwen-VLA and π₀-RL both follow this recipe.
Box-and-caption annotation that keeps the backbone literate
Auxiliary VL data is collected with classical crowdsourcing — bounding boxes, referring expressions, dense captions, driving QA — and exists to stop the action loss from quietly destroying the VLM's language ability.
- 01 · Source imagesMined from COCO, OpenImages, Visual Genome, LAION, driving logs (nuScenes, Waymo), or robot-camera frames sampled from teleop runs.
- 02 · AnnotationMechanical-Turk-style pipelines collect boxes, referring expressions (RefCOCO), region captions, or VQA pairs. Driving sets add HD-map and trajectory ground truth from offline auto-labeling.
- 03 · Fine-grained action captionsThe newest and most surgical slice: a 1–3 s robot clip is densely captioned (“rotate clockwise 30°, then slide left 4 cm”) by trained annotators, producing the supervision that disambiguates ambiguous coarse labels.
- 04 · Quality filteringSpans are CLIP-scored, deduped, and LLM-rewritten for stylistic consistency with downstream instruction formats.
No special hardware — the cost is purely human-hours. Annotation tools: CVAT, Label Studio, Scale, Surge.
Standard VL pairs: { I, text }, optionally with bounding boxes b ∈ ℝ^4 serialized into the text as <box>x0 y0 x1 y1</box>. For driving: (I, ℓ, future-trajectory).
L_VL = − Σ_t log p_θ( y_t | y_{<t}, I )Plain causal-LM cross-entropy on the textual targets. The auxiliary VL loss is added to the action loss with a small weight (≈0.1–0.3) so the backbone is continually reminded how to read the world while the action head is being trained. Without it, after ≈20k steps the VLM's grounding capabilities visibly degrade.
How one frontier VLA spends its tokens.
The Qwen-VLA pretraining mixture is a reasonable proxy for what a modern unified VLA looks like in 2026. Three quarters of the budget goes to manipulation trajectories. The remaining quarter is the strategically interesting part — every slice is there for a measurable reason.
- Robot manipulation trajectories74.2%
- Navigation trajectories7.5%
- Egocentric human trajectories6.0%
- Synthetic simulation (ours)3.7%
- General vision-language data3.4%
- Spatial grounding (2D)2.5%
- Autonomous-driving VQA2.4%
- Fine-grained action captions0.2%
Not because it isn't valuable, but because dexterous ego-video supervision still needs an inverse-dynamics or hand-pose pipeline to convert into action labels. As EgoDex / EgoVerse standardize, expect this slice to triple within a year.
They are surgical, not bulk. Their job is to disambiguate the cases where the same coarse label (“pick up the bowl”) maps to two valid motions. A small slice forces the model to ground action sequences in dense, ordered language.
Driving datasets are the only mature source of long-horizon trajectory-centric supervision: ego pose, lane-relative position, future waypoints. They make the same flow-matching head work for autonomous driving with no architectural change.
The bar for usable physical realism is still set by teleop. Every other family is being added to extend the policy beyond what teleop can cover — never to replace it. That ratio is unlikely to flip before 2028.
How the field reports a number.
These are the suites a 2026 unified VLA is expected to publish on. Numbers below are reported results for Qwen-VLA-Instruct (arXiv 2605.30280). They are useful less as a leaderboard and more as a map of what the field currently measures — and where it still does not.
| Suite | Embodiment | What it measures | SOTA | Note |
|---|---|---|---|---|
| LIBERO | Single Franka | 4 long-horizon suites (Spatial / Object / Goal / Long) | 97.9% | Effectively saturated by leading VLAs. |
| Simpler-WidowX | WidowX | Real-to-sim, aligned with real WidowX | 73.7% | The honest sim benchmark — correlates with hardware. |
| RoboCasa-GR1 | Bimanual humanoid (GR-1) | 24 atomic kitchen tasks | — | Best probe of household generalization. |
| RoboTwin 2.0 (Easy/Hard) | Dual-arm | 50 bimanual tasks | 86.1 / 87.2% | Hard tier still exposes long-horizon coordination failures. |
| R2R (OSR) | Mobile (Matterport3D) | Vision-and-Language Navigation | 69.0% | Discrete-graph instruction following. |
| RxR (SR) | Mobile | Multilingual VLN, longer paths | 59.6% | Dense, time-aligned instructions in 3 languages. |
| ALOHA real-world OOD | Bimanual ALOHA | Out-of-distribution real-world | 76.9% | Best honest measure of real-world generalization. |
| DOMINO (zero-shot) | Single-arm | Dynamic manipulation, zero-shot | 26.6% | Frontier is still very far from solved. |
Primary sources.
If you build one thing after reading this, build a pretraining mixture. These are the papers and dataset pages to read first.
- 2026Qwen-VLA: Unifying Vision-Language-Action ModelingQwen TeamThe unified action-and-trajectory framework this guide is built around.
- 2024Open X-EmbodimentDeepMind + 21 institutionsThe schema that made cross-embodiment pretraining tractable.
- 2024DROID: A Large-Scale In-the-Wild Robot Manipulation DatasetKhazatsky et al.76k Franka demos collected across 18 institutions.
- 2024
- 2026
- 2023
- 2024–25
- 2025–26
- 2025
- 2018–20