The Garbage-In-Garbage-Out Problem for Robot Policies

Imitation learning is a distribution matching problem: the trained policy will approximate the distribution of behaviors in its training data. If that distribution includes failed grasps, jerky motions, inconsistent strategies, and ambiguous success criteria, the policy faithfully learns to reproduce all of them. Unlike language model training, where individual noisy examples are smoothed out by billions of other examples, robot demonstration datasets are small enough that every single episode matters.

A dataset of 500 carefully controlled demonstrations will produce a better policy than a dataset of 5,000 uncontrolled demonstrations. This is not an aspirational claim -- it is a consistent empirical finding across research groups, task types, and policy architectures. The explanation is straightforward: a clean, diverse, consistent dataset gives the policy a clear signal about what to learn. A noisy, biased, inconsistent dataset gives the policy a confused signal that it resolves by averaging over conflicting behaviors, producing mediocre performance on everything rather than strong performance on anything.

The framework below identifies six dimensions that distinguish high-quality robot demonstration data from mediocre data. Each dimension has a measurable metric and a practical threshold.

Dimension 1: Diversity

Diversity is the most important quality dimension because it directly determines how well the policy generalizes. A dataset must include variation across every axis that will differ between training and deployment: object instances, object positions, lighting conditions, background clutter, and operator behavior.

Object diversity: Include at least 10-20 distinct instances of each target object category, varying in size, color, material, and brand. If your task involves picking up cups, collect demonstrations with ceramic mugs, paper cups, plastic travel cups, glass cups, and metal cups. Each instance teaches the policy something different about the visual and physical properties of the category.

Position diversity: Vary object starting positions across the full reachable workspace, using a grid of at least 30 x 40 cm. Include different orientations -- upright, tilted, rotated 90 degrees. If the policy only sees objects in the center of the workspace during training, it will fail at the edges during deployment.

Lighting diversity: Collect under at least 3 distinct lighting conditions: warm overhead fluorescent, cool daylight from windows, and mixed or directional lighting. Lighting is one of the most common sources of deployment failure because it changes the appearance of every object and surface in the scene.

Operator diversity: Use at least 3 distinct operators, each contributing a roughly equal share of demonstrations. Each operator approaches the task differently -- different approach angles, grasp points, speeds, and recovery strategies. This diversity is valuable because it forces the policy to learn the task structure rather than a single person's idiosyncrasies.

Dimension 2: Consistency

Consistency means that the task definition, success criteria, and reset procedure are identical across all episodes. Inconsistency introduces ambiguity that the policy cannot resolve.

Success criteria must be binary and unambiguous. For pick-and-place tasks, success means the object is at the target location within a defined tolerance at the end of the episode. "Close enough" is not a criterion. Write the success criteria in your collection protocol and verify that all operators apply them the same way.

Reset procedure must be standardized. Between episodes, objects must be placed in new starting positions according to a defined randomization protocol, the workspace must be cleared of debris or displacement from the previous trial, and the robot must return to a consistent starting configuration. Sloppy resets introduce systematic biases -- objects that accumulate near certain locations because operators default to placing them there, background clutter that drifts across episodes.

Operator calibration is essential. Before an operator's demonstrations count toward the dataset, they should complete a calibration session of 2-4 hours to learn the teleoperation interface, internalize the success criteria, and develop consistent approach strategies. Track per-operator quality metrics (success rate, trajectory smoothness, episode duration) and provide feedback. Uncalibrated operators produce demonstrations that actively harm policy performance.

Dimension 3: Completeness

Every episode must capture the full task from initial approach through final release, with all sensor streams synchronized and no missing data. Incomplete episodes corrupt the training signal in subtle ways that are difficult to diagnose after the fact.

No missing modalities. If your collection setup includes two cameras, joint encoders, and a force-torque sensor, every episode must have all four streams. A single episode with a dropped camera feed teaches the policy that the missing camera is sometimes zero, which confuses the visual processing pipeline.

Synchronized timestamps. All sensor streams must be time-aligned to within the control period (typically 5-20 ms). Misaligned streams create an inconsistent mapping between what the robot sees and what it does -- the action at time T is paired with the observation from time T minus 50 ms, producing a systematically shifted training signal. Verify synchronization automatically by checking timestamp alignment in every episode.

Full episode recording. Episodes that start mid-grasp (because recording was triggered late) or end before task completion (because recording was stopped early) are unusable for most policy training pipelines. Configure your collection system to start recording before the task begins and stop after the robot returns to its starting configuration.

Dimension 4: Accuracy

Accuracy covers the correctness of both the demonstrations themselves and the metadata attached to them.

Demonstration accuracy: Only fully successful episodes should be included in the imitation learning training set. The performance impact of including failed demonstrations is dramatic -- adding even 10% failed demonstrations to a training set typically causes a 20-30% drop in policy success rate. The mechanism is clear: the policy learns that "almost grasping" or "dropping halfway" is an acceptable terminal state. Filter rigorously: binary success classification on every episode, with human review on borderline cases.

Trajectory quality: Demonstrations should be smooth, deliberate, and efficient. Jerky trajectories -- caused by operator error, controller latency, or poor workspace ergonomics -- teach the policy to be jerky. Measure smoothness using the jerk metric (third derivative of joint positions), establish per-task baselines from your best operators, and filter demonstrations below 70% of that baseline.

Annotation accuracy: Success/failure labels, language instruction labels, task phase segmentation, and object identity tags must be correct. Incorrect annotations corrupt the training signal. Language instructions should be checked against actual task behavior ("pick up the red cup" should not be tagged on an episode where the operator picked up a blue bowl). Automated validation tools should flag mismatches between annotations and observations.

Dimension 5: Balance

A balanced dataset has roughly equal representation across conditions. Imbalance causes the policy to overfit to over-represented conditions and underperform on under-represented ones.

Object balance: If you have 15 object instances but 60% of demonstrations use 3 of them, the policy learns those 3 objects well and the other 12 poorly. Aim for equal demonstrations per object instance, with no more than 2:1 ratio between the most-represented and least-represented instances.

Position balance: If most demonstrations start with objects in the center of the workspace, the policy will be weak at workspace edges. Use a defined grid or randomization scheme that ensures spatial coverage.

Difficulty balance: Approximately 10-15% of demonstrations should cover deliberately challenging scenarios: objects at the edge of the reachable range, cluttered workspaces, unusual orientations, near-failure recoveries. These edge cases dramatically improve robustness without requiring proportionally more data. Under-representation of edge cases is one of the most common causes of unexpected deployment failures.

Dimension 6: Format

Data format determines how easily the dataset integrates with training pipelines, how efficiently it can be stored and transferred, and whether it is compatible with community standards.

Use established formats. HDF5 or Zarr for raw episode storage. The LeRobot HuggingFace format for sharing and community compatibility. Open X-Embodiment schema for cross-embodiment research. Custom formats create friction for every downstream consumer and are the leading cause of "we collected data but can't use it" failures.

Include complete metadata. Every episode should include: task description, success/failure label, language instruction, robot platform identifier, camera intrinsics and extrinsics, collection date, operator identifier, and any environment conditions that varied (lighting setup, table surface, object instance IDs). This metadata enables filtering, stratified analysis, and targeted retraining on specific conditions.

Validate format automatically. Run schema validation on every episode at write time. Catching format errors during collection is vastly cheaper than discovering them during training, when an engineer spends hours debugging why the dataloader crashes on episode 3,847.

The 10-Point Data Quality Checklist

  1. At least 10 distinct object instances per target category
  2. At least 3 lighting conditions represented
  3. At least 3 operators contributing roughly equal episodes
  4. Written success criteria applied consistently across all episodes
  5. Standardized reset protocol documented and followed
  6. All sensor streams present and synchronized in every episode
  7. Full episode capture from approach through completion
  8. Binary success classification with human review on borderline cases
  9. Trajectory smoothness filtered against per-task baselines
  10. Data stored in an established format (HDF5/Zarr/LeRobot) with complete metadata

If your dataset passes all 10 points, it is production-quality data suitable for training reliable policies. If it fails on any point, fix that dimension before collecting more data -- more data with the same quality problem just produces more of the same problem.

How SVRC Ensures Quality in Managed Collection

All demonstration data collected through SVRC data services passes through an automated quality pipeline that enforces every dimension described above. The pipeline includes: automated success classification using learned classifiers, smoothness scoring with operator-specific baselines, coverage analysis in visual embedding space to verify object and environmental diversity, timestamp synchronization verification, schema validation on every episode, and human review on all borderline cases.

You receive quality-certified data with per-episode quality scores, aggregate coverage statistics showing diversity across all axes, and a quality report that documents how the dataset performs on each of the six dimensions. This certification means you can train with confidence that your data meets the standard -- rather than discovering quality problems three weeks into a training run.

For teams building their own collection infrastructure, SVRC also offers quality audits on externally-collected datasets. We apply the same quality pipeline to your data and provide a detailed report identifying which dimensions need improvement. Contact the SVRC team to discuss your data quality needs.