Been building with LeRobot v3 and kept hitting the same wall: retrain a policy, it gets worse, no clear idea why. DVC tells you files changed. MLflow tells you which run. Nobody tells you which tasks dropped or which episodes degraded between dataset versions.
Built a small open source library to fill that gap.
Four commands
episodevault track ./my_dataset
episodevault commit -m "added kitchen episodes"
episodevault diff v1.0 v2.0
episodevault blame model_v3
What the diff looks like
Ran this against two real LeRobot datasets:
Dataset diff: v1.0 → v2.0
────────────────────────────────────────────────────
Episodes added: +0
Episodes removed: -7
Distribution shift:
factory_pick 2 → 6 ↑ 200% ⚠️
kitchen_grasp 4 → 1 ↓ 75% ⚠️
Quality metrics:
avg episode length: 3.7s → 3.0s ↓
success_rate: 0.88 → 0.38 ↓
camera_sync_score: 1.00 → 1.00 →
Likely regression cause:
'kitchen_grasp' episodes dropped 75% (4 → 1). Restore from prior
version if this task is in your eval benchmark.
The blame command
One line in your training script:
import episodevault as ev
ev.log_training_run(model_version="model_v3", dataset_version="v2.0")
Then later:
episodevault blame model_v3
Traces the model back to the exact dataset version that trained it and shows the diff automatically.
Compatibility
Tested against four real datasets from the Hub:
| Robot | Dataset | Episodes | Frames | Parse time |
|---|---|---|---|---|
| aloha | aloha_static_pro_pencil | 25 | 8,750 | 0.35s |
| aloha | aloha_mobile_shrimp | 18 | 67,500 | 0.38s |
| so100 | svla_so100_stacking | 56 | 22,956 | 0.63s |
| aloha | aloha_mobile_cabinet | 85 | 127,500 | 2.73s |
Install
pip install episodevault
Python 3.10+. Works on any local LeRobot v3 dataset.
GitHub: Rohan-Prabhakar/EpisodeVault
If you have a dataset where this breaks or gives a wrong regression hint, open an issue, that’s the most useful feedback right now.