My first Results: PI0.5 VLA Policy

Hi!

Here are my first results for the AI for Industry Challenge . I’ve been working on an end-to-end VLA (Vision-Language-Action) policy to tackle the autonomous cable insertion task.

In this video, the robotic arm uses a PI0.5 architecture to interpret a natural language command and attempt the insertion of an SC/SFP connector in a Gazebo simulation, though, as you can clearly see, it doesn’t actually manage to plug any of them in yet! :joy: :rofl:

Technical Overview:

  • Architecture: PI0.5 (PaliGemma 2B + Gemma 300M Action Expert).
  • Software Stack: ROS 2 and the LeRobot framework.
  • Inference Hardware: NVIDIA RTX 5090.
  • Simulation: Gazebo Sim with RGB camera feeds and joint/wrench state feedback.

The policy already shows a strong semantic understanding of the task and good spatial reaching capabilities. I’m currently focused on fine-tuning the real-time control loop and the transition between the approach and the final insertion phase to compensate for inference latency.

Best regards!

2 Likes

same here tried with both standard lerobot ACT and Diffusion policy, gets roughly to near the port but has very poor accuracy.

1 Like

@bha51 I found the same.

I started gathering data for ACT/VLA policy training, but I’m hitting a pretty hard mental wall which is: if there are multiple SCs/NICs (which is possible at test-time), how can our policy know which one to choose? This feels like a strong blocker of any off the shelf VLA fine-tuning approach (i.e. you’d need a substantially larger number of episodes).

Curious if anyone has considered this. Happy to share some of my findings/thoughts.

Hi @Andrew_Garrett

Great point! My approach is to label the dataset with specific text for each connector (sfp, sc) and slot (nic_card_mount_0, mount_1, …).

This forces the VLA to learn spatial grounding : mapping the text ID to a unique pixel region, even if the cards look identical.

However, to make this work, I think I need more data than what I have right now. I’m planning to scale up to 100 episodes per prompt (around 1,000 episodes total). Currently, I only have about 5 episodes per task (around 100 episodes), which is definitely not enough. Also, my initial tests were only 5,000 training steps , and I’m planning to increase that significantly.

Regards!

1 Like