The Physical AI Data Stack, Explained
What it takes to produce training data for robots and embodied models — sensor synchronization, teleoperation protocols, episode validation, and delivery formats.
By Data Team
Physical AI data is synchronized, multi-sensor recordings of embodied behavior — robot demonstrations, teleoperation, video, depth, LiDAR, force, and kinematics — packaged as validated episodes for training VLA and world models.
Why robotics data is different
Text data can be cleaned after the fact. Sensor data mostly cannot: if two cameras drift 300ms apart during collection, the episode is unusable and no amount of post-processing recovers it. Quality must be engineered into the collection protocol.
The stack, layer by layer
1. Task and environment design
A scenario matrix across objects, layouts, lighting, and distractors. Diversity here determines generalization later — and it must be tracked, not assumed.
2. Sensor synchronization
Common-clock or hardware-triggered capture across RGB, depth, LiDAR, force/torque, and joint states. Spec the acceptable drift (single-digit milliseconds for manipulation) and verify it per episode.
3. Operator protocol
Teleoperators are part of the dataset. Scripted task phases, consistent grasp strategies where required, and deliberate variation where it helps — all documented so embodiment consistency can be audited.
4. Validation gates
Automated episode checks before annotation: timestamp drift, drop frames, completeness, calibration freshness. Reject early; annotating a broken episode wastes money twice.
5. Annotation
Task-phase boundaries, object states, success/failure outcomes, and safety events — the labels that turn raw streams into supervision signal.
6. Delivery
MCAP and ROS bag for robotics-native pipelines; JSONL episode indexes and Parquet for ML pipelines; MP4 proxies for human review.
What to ask any data partner
- What is your measured cross-stream timestamp drift?
- What percentage of episodes fail validation, and where?
- How do you track environment and scenario diversity?
- Can you deliver in our schema, and version it?
If the answers are vague, the dataset will be too.