Rocky's Open-Source Build Thread (AI for Industry Challenge)

Hi @Hui_Liu

You can follow this aic/docs/scene_description.md at main · intrinsic-dev/aic · GitHub

/entrypoint.sh [parameters]

If you open this script, you will see that it calls:

exec ros2 launch aic_bringup aic_gz_bringup.launch.py "$@"

The “$@” at the end means that any parameter you write to the entrypoint.sh command is passed.

That’s why, when you run /entrypoint.sh ground_truth:=true, that ground_truth:=true ends up in the ROS 2 launch file.

Here an example with a nic_card_0 and a sc_port_0 with an image,

/entrypoint.sh spawn_task_board:=true \
    task_board_x:=0.3 task_board_y:=-0.1 task_board_z:=1.2 \
    task_board_roll:=0.0 task_board_pitch:=0.0 task_board_yaw:=0.785 \
    sfp_mount_rail_0_present:=true sfp_mount_rail_0_translation:=-0.08 \
    sc_mount_rail_0_present:=true sc_mount_rail_0_translation:=-0.09 \
    nic_card_mount_0_present:=true nic_card_mount_0_translation:=0.005 \
    sc_port_0_present:=true sc_port_0_translation:=-0.04 \
    spawn_cable:=true cable_type:=sfp_sc_cable attach_cable_to_gripper:=true \
    ground_truth:=true start_aic_engine:=false

1 Like

You might also want to look into the the difference in the mechanical properties between the sfp and sc connectors, defined in their respective model files

oh, thanks a lot! totally missed it! I see it also mentioned about taring for the force/torque sensors

Thanks a lot for this amazing guide! I’m a bit confused though, is this the sensor taring that the documentation was referring to or something else?

Thanks for the tip! I’ll look into that (probably after Wednesday? I’m swamped in midterms week lollll)

Yes, it is. The taring is done with pixi run ros2 service call /aic_controller/tare_force_torque_sensor std_srvs/srv/Trigger {}

1 Like

Thanks a lot. Very useful guide.

wow! amazing! now my force sensor looks much better now. although I still see two cases that I am wonder if you saw it on your side too:

  1. force_z direction still gets a very high peak before insertion is done. visually it seems to happen when the TCP is on top of the port. So maybe the gripper hit somewhere? but I thought the cheatCode policy should be perfect. I am a bit surprised that cheatCode actually hit something.
  2. I thought the offset should be constant after tare? but I see the offset values actually changes through time… why is that?
1 Like

Hi @Hui_Liu

Honestly, looking at raw data without visual confirmation is extremely difficult to diagnose, as many different physical or software issues can manifest as similar-looking force spikes.

I’ve been investigating some performance and data quality issues in the simulation pipeline over the past few days, and I wanted to share some of my findings. I will finish with visual examples and sensor data to make it more concrete and try to get closer to your comment with my data.

Before start recording again, I’ve been doing a deep dive into my own dataset recordings lately, motivated by the poor performance of a π0.5 model I trained. Initially, I used mostly default values, assuming the standard 30Hz recording frequency was fine.

However, after updating to LeRobot 0.5.0 , I started seeing massive “Record loop is running slower” warnings:

WARNING cord_050.py:446 Record loop is running slower (1.2 Hz) than the target FPS (30 Hz).
WARNING cord_050.py:446 Record loop is running slower (4.5 Hz) than the target FPS (30 Hz).

This led me to investigate the source code and documentation, where I realized the simulated Basler cameras are actually capped at 20fps :smile:

# From aic_robot.py
"center_camera": ROS2CameraConfig(name="center_camera", fps=20, width=1152, height=1024, ...)

Recording at 30Hz when the source is 20Hz is a recipe for disaster. I even tried a “middle ground” at 15Hz , but LeRobot threw assertion errors for about 70% of my frames because the timestamps didn’t align with the video.

Conclusion: recording frequency must be a divisor of the camera’s native 20Hz (e.g., 20, 10, 5, or 4 Hz ).

Even so, reaching a stable 20Hz is tough. Checking the ROS topic frequency reveals a significant drop over time:

$ pixi run ros2 topic hz /left_camera/image
average rate: 19.380
        min: 0.041s max: 0.056s std dev: 0.00315s window: 19
...
average rate: 14.013
	min: 0.033s max: 0.216s std dev: 0.03566s window: 1573

I’ve settled on 10Hz as a compromise for precision tasks like insertion. 4Hz or 5Hz feels too low for the “last centimeter” reactivity required here.

The “Cold Start” Problem
I also noticed that the worst lag spikes happen at the very beginning of the recording. As you can see in my logs below, the Obs (observation) time can spike to 798ms (nearly 1 second!) for a single frame:

[INFO] [1775219197.456747804] [aic_cheatcode_bridge_teleop]: ▶️  Starting trial: trial_004
WARNING 2026-04-03 14:26:37 cord_050.py:446 Record loop is running slower (5.7 Hz) than the target FPS (10 Hz). 
Dataset frames might be dropped and robot control might be unstable. Common causes are: 1) Camera FPS not keeping up 2) Policy inference taking too long 3) CPU starvation
INFO 2026-04-03 14:26:37 cord_050.py:455 TIMING - Obs: 55.3ms, Proc: 0.0ms, Frame: 0.0ms, Teleop: 63.2ms, Dataset: 55.7ms
INFO 2026-04-03 14:26:37 cord_050.py:455 TIMING - Obs: 35.5ms, Proc: 0.0ms, Frame: 0.0ms, Teleop: 0.3ms, Dataset: 5.6ms
WARNING 2026-04-03 14:26:38 cord_050.py:446 Record loop is running slower (1.2 Hz) than the target FPS (10 Hz). 
Dataset frames might be dropped and robot control might be unstable. Common causes are: 1) Camera FPS not keeping up 2) Policy inference taking too long 3) CPU starvation
INFO 2026-04-03 14:26:38 cord_050.py:455 TIMING - Obs: 798.3ms, Proc: 0.0ms, Frame: 0.0ms, Teleop: 0.3ms, Dataset: 53.6ms
WARNING 2026-04-03 14:26:38 cord_050.py:446 Record loop is running slower (4.5 Hz) than the target FPS (10 Hz). 
Dataset frames might be dropped and robot control might be unstable. Common causes are: 1) Camera FPS not keeping up 2) Policy inference taking too long 3) CPU starvation
INFO 2026-04-03 14:26:38 cord_050.py:455 TIMING - Obs: 164.2ms, Proc: 0.0ms, Frame: 0.0ms, Teleop: 0.5ms, Dataset: 55.7ms

I suspect this is due to the massive bandwidth of 3x 1152x1024 images (1.2MB each) saturating the Zenoh bridge, or perhaps the overhead of initializing the NVENC GPU encoder session. After a few seconds, it stabilizes, and the warnings disappear. Has anyone found a way to mitigate this initial “warm-up” lag?

RTF and Force Spikes
Another critical factor is the Real Time Factor (RTF) . In my tests, RTF starts at ~95% but drops to 30-40% during the actual insertion phase. This effectively puts the physics in “slow motion” while the recorder keeps ticking, which I fear might corrupt the learning of robot dynamics.

You can see how RTF starts with 98.39% and in minute (1:00) you have 24.93%.

Regarding the force peaks , I am seeing them too. In video recording, I hit a 44N peak in Force Z . By using a custom visualizer to sync the dataset with the frames, I thinkg the culprit is The Cable.

  • left camera

  • right camera

  • center camera

  • Gazebo

As you can see in the global view at Frame 464 , the cable gets severely kinked into an “N” shape against the board. This creates massive mechanical tension that forces the connector out of alignment, causing it to hit the rim of the port. Even though the “CheatCode” policy is sending the correct coordinates, the physics of the cable is fighting back.

Some questions that come to my mind:

  • What is the community’s “sweet spot” FPS for this task?
  • Has anyone else seen these massive get_observation spikes specifically at the start of recordings?
  • How do you handle the trade-off between raw resolution (1152x1024) and maintaining a stable 20Hz / RTF 1.0?

Regards!

2 Likes

Hi!

I’ve been following issues:

And PR:

and I believe they are closely related to what I’m observing in my data collection.

I have performed a thorough audit of my initial 100-episode dataset (recorded with LeRobot 0.4.3 at the default 30Hz) and found several critical issues:

  • Temporal Instability: Many episodes contain between 14% and 33% duplicate frames . (which is expected given the camera output is capped at 20fps, making 30Hz an incompatible recording frequency). This level of inconsistency makes the dataset quite unstable for learning precise temporal dynamics.
  • Massive Force Spikes: I’m seeing high-magnitude force spikes in the Z-axis, reaching values as extreme as -124N and -269N in a single frame during the insertion phase (as shown in the next images). This appears to be a failure in the simulation physics/solver.

My main concerns are:

  • How will these physics “explosions” affect the official evaluation ?
  • Is it worth continuing with a large-scale data collection right now, or would it be better to wait until the environment and physics are stabilized?

Training a VLA model on -270N impacts and inconsistent frame rates seems counter-productive for achieving high-precision insertion. I’d love to hear your thoughts on whether these issues are being addressed before the final benchmark.

Best regards,

4 Likes

Hi, @Rochy_Shao . When running the lerorobot teleop command, ComponentLike import error occurs. Also, I installed the rerun-sdk in the pixi.toml file and it was installed.

Did you get the same error while running the Lerobot Teleop?

Hello there!

I don’t remember having import issues with running the provided LeRobot Teleop commands,
I think I just followed the installation guide from the competition github page, and my environment is a Ubuntu 24.04 LTS dualboot.

Might be relevent (or might not, not tooo sure) is when I change some code in the repo, I incrase this number:

and do a pixi-install of the whole workspace.

To be honest it’s also been a while (almost a month) since I ran the provided lerobot script, and I have very little memeory of what exact issues I had (if any at all) :sweat_smile:

Dumb Idea

(I also suggest let claude run on autopilot and see if claude can try-error-getLucky-fix error haha)

1 Like

This is a great thread. For someone who has experience in LeRobot/So100 arms but not much with ros2/gazebo, I am finding it very helpful. So far I have the AIC environment setup with default ACT policy working with Gazebo integration. As I get to finetuning and trying out pi0.5 and other models, couple of n00b questions for folks active in this thread. @jlamperez @Rocky_Shao

  1. I saw @Rocky_Shao 's dataset rockyshao22/Intrinsic_AI · Datasets at Hugging Face here. It has 3 episodes. Is it generated using CheatCode policy or aic/aic_utils/lerobot_robot_aic/README.md at main · intrinsic-dev/aic · GitHub ? If CheatCode, what are the steps and is the output of running cheatCode policy automatically stored in LeRobot training format?
  2. Do people manually teleoperate (apart from CheatCode policy) to generate training data. any recommendations/best practices

Thanks!

So far I figured out the following to setup the scene (need to play around with parameters as the scene is not correct so far)

/entrypoint.sh \
spawn_task_board:=true \
task_board_x:=0.3 task_board_y:=-0.1 task_board_z:=1.2 \
spawn_cable:=true cable_type:=sfp_sc_cable attach_cable_to_gripper:=true \
ground_truth:=true start_aic_engine:=false

for recording:

pixi run lerobot-record \
--robot.type=aic_controller --robot.id=aic \
--teleop.type=aic_keyboard_ee --teleop.id=aic \
--robot.teleop_target_mode=cartesian --robot.teleop_frame_id=base_link \
--dataset.repo_id=pranavsaroha/aic_cable_insertion \
--dataset.single_task="Insert SFP cable into SC port" \
--dataset.num_episodes=2 \
--dataset.push_to_hub=true \
--dataset.private=true \
--play_sounds=false \
--display_data=true

Looks like I am missing two things:

  1. setup the scene board properly for task “Insert SFP cable into SC port”
  2. learn teleop with keys (am much more used to so100 leader..)

Ideally, would be great if cheatCode policy inference output can be the rows of training data in lerobot format instead of manual teleoperation. that’s my next goal.. appreciate any pointers there

Sharing my experience so far with cheatCode policy. here are the steps

  1. Run the cheatcode policy like other policies (ACT)

  2. run ros2 bag command to record, the important piece is RMW_IMPLEMENTATION=rmw_zenoh_cpp otherwise ros2 bag does not record. the AIC simulation uses Zenoh DDS.

    RMW_IMPLEMENTATION=rmw_zenoh_cpp ros2 bag record \
      /observations \
      -o ~/bags/cheatcode_episode_001
    
  3. Convert Ros2bag to LeRobot format

  4. In the container run the script to create lerobot dataset

One thing I noticed, the episode boundaries are a bit off, i.e the second episode’s (first 15 seconds are same as trial 1’s end).

Still need to figure out if the action values are right or not. Claude ended up suggesting to use Action = tcp_velocity from observation. Its reasoning CheatCode uses MODE_POSITION; but motion_update.velocity is always zero

Baby steps..

next step is to figure out strategy to generate more comprehensive scenarios for cheatCode to generate the ground truth.

Hello there! Thanks for finding this build thread helpful.

(Below content is formatted with AI, but first version was typed by human (me ofc) and below content is proof-read by human. Also please note this is my first year learning ROS as a college freshmen so take my info with large chunks of salt loll)

I am running a modified version of the cheat code policy that integrates with the provided le-robot manual teleop script to collect data. Let’s call this “custom-cheatcode”.

This “custom-cheatcode” can control the robot to insert plug to port just like the provided cheat-code policy using ground-truth TF frames, but also automatically record the data to le-robot format. (“custom-cheatcode” can also upload data automatically to hugging face with details described below)

:laptop: Source Code

Here is the source code for the custom-cheatcode:

https://github.com/Rocky0Shao/IntrinsicAIChallenge/blob/76a4afcaa91fddb7f89c2280fb2701ff23dc6093/aic_utils/lerobot_robot_aic/lerobot_robot_aic/aic_teleop.py#L356


:rocket: How to Run “custom-cheatcode”

Terminal 1: Run these two bash files. The first command "start_docker.sh"enters Docker, and the second command “start_scoring.sh” sets up a scene (based on the test scoring scene).

https://github.com/Rocky0Shao/IntrinsicAIChallenge/blob/76a4afcaa91fddb7f89c2280fb2701ff23dc6093/mystuff/custom_commands/start_docker.sh#L1

https://github.com/Rocky0Shao/IntrinsicAIChallenge/blob/76a4afcaa91fddb7f89c2280fb2701ff23dc6093/mystuff/custom_commands/start_scoring.sh#L1

Note on my setup:
“start_scoring.sh” sets up a scene based on layout of boards / plugs from the offical scoring script.

I was originally hoping to “start_scoring.sh” to mass-generate and save multiple different scenes for data collection. Instead, I got fixated on getting a robust enough custom-cheatcode to insert both plug types into both port types with a 100% first-try success rate. I haven’t actually accomplished this yet, and I haven’t started “training” at all.

Terminal 2: In a separate terminal, run this bash command to launch the “custom-cheatcode”:

https://github.com/Rocky0Shao/IntrinsicAIChallenge/blob/main/mystuff/custom_commands/teleop.sh


:warning: A Quick Disclaimer & Status Update

Just a heads up: Don’t use my Hugging Face data. That dataset was strictly just to test if I could actually collect data using this custom script.

The bash command I used to collect training data in lerobot format and upload to huggingface using “custom-cheatcode”:

https://github.com/Rocky0Shao/IntrinsicAIChallenge/blob/76a4afcaa91fddb7f89c2280fb2701ff23dc6093/mystuff/custom_commands/record_training_data.sh#L1

(Note there are some nuances, where you have to make hugging face account and log in via terminal and other small things I can’t remember. However, I got this working with lots of help from AI)

I was stuck at the step of automatically collecting high-quality training data using the custom-cheatcode. Then last month before finals week hit (which I’m in the middle of right now), so I haven’t worked on this project for a couple of weeks.

Good luck with your setup!

Hot Take Advice

Give the AI the link to this discourse-thread and the links to the files I sent above.
Then let AI provide a run down