The Physical AI Data Stack, Explained

Physical AI data is synchronized, multi-sensor recordings of embodied behavior — robot demonstrations, teleoperation, video, depth, LiDAR, force, and kinematics — packaged as validated episodes for training VLA and world models.

Why robotics data is different

Text data can be cleaned after the fact. Sensor data mostly cannot: if two cameras drift 300ms apart during collection, the episode is unusable and no amount of post-processing recovers it. Quality must be engineered into the collection protocol.

The stack, layer by layer

1. Task and environment design

A scenario matrix across objects, layouts, lighting, and distractors. Diversity here determines generalization later — and it must be tracked, not assumed.

2. Sensor synchronization

Common-clock or hardware-triggered capture across RGB, depth, LiDAR, force/torque, and joint states. Spec the acceptable drift (single-digit milliseconds for manipulation) and verify it per episode.

3. Operator protocol

Teleoperators are part of the dataset. Scripted task phases, consistent grasp strategies where required, and deliberate variation where it helps — all documented so embodiment consistency can be audited.

4. Validation gates

Automated episode checks before annotation: timestamp drift, drop frames, completeness, calibration freshness. Reject early; annotating a broken episode wastes money twice.

5. Annotation

Task-phase boundaries, object states, success/failure outcomes, and safety events — the labels that turn raw streams into supervision signal.

6. Delivery

MCAP and ROS bag for robotics-native pipelines; JSONL episode indexes and Parquet for ML pipelines; MP4 proxies for human review.

What to ask any data partner

What is your measured cross-stream timestamp drift?
What percentage of episodes fail validation, and where?
How do you track environment and scenario diversity?
Can you deliver in our schema, and version it?

If the answers are vague, the dataset will be too.