EpisodeVault: open source tool to debug why your LeRobot model regressed

Been building with LeRobot v3 and kept hitting the same wall: retrain a policy, it gets worse, no clear idea why. DVC tells you files changed. MLflow tells you which run. Nobody tells you which tasks dropped or which episodes degraded between dataset versions.

Built a small open source library to fill that gap.


Four commands

episodevault track ./my_dataset
episodevault commit -m "added kitchen episodes"
episodevault diff v1.0 v2.0
episodevault blame model_v3


What the diff looks like

Ran this against two real LeRobot datasets:

Dataset diff: v1.0 → v2.0
────────────────────────────────────────────────────
Episodes added:    +0
Episodes removed:  -7

Distribution shift:
  factory_pick        2 → 6   ↑ 200%  ⚠️
  kitchen_grasp       4 → 1   ↓ 75%   ⚠️

Quality metrics:
  avg episode length:    3.7s → 3.0s  ↓
  success_rate:          0.88 → 0.38  ↓
  camera_sync_score:     1.00 → 1.00  →

Likely regression cause:
  'kitchen_grasp' episodes dropped 75% (4 → 1). Restore from prior
  version if this task is in your eval benchmark.


The blame command

One line in your training script:

import episodevault as ev
ev.log_training_run(model_version="model_v3", dataset_version="v2.0")

Then later:

episodevault blame model_v3

Traces the model back to the exact dataset version that trained it and shows the diff automatically.


Compatibility

Tested against four real datasets from the Hub:

Robot Dataset Episodes Frames Parse time
aloha aloha_static_pro_pencil 25 8,750 0.35s
aloha aloha_mobile_shrimp 18 67,500 0.38s
so100 svla_so100_stacking 56 22,956 0.63s
aloha aloha_mobile_cabinet 85 127,500 2.73s

Install

pip install episodevault

Python 3.10+. Works on any local LeRobot v3 dataset.


GitHub: Rohan-Prabhakar/EpisodeVault

If you have a dataset where this breaks or gives a wrong regression hint, open an issue, that’s the most useful feedback right now.