My first Results: PI0.5 VLA Policy

jlamperez · March 30, 2026, 8:20pm

Hi!

Here are my first results for the AI for Industry Challenge . I’ve been working on an end-to-end VLA (Vision-Language-Action) policy to tackle the autonomous cable insertion task.

In this video, the robotic arm uses a PI0.5 architecture to interpret a natural language command and attempt the insertion of an SC/SFP connector in a Gazebo simulation, though, as you can clearly see, it doesn’t actually manage to plug any of them in yet!

Technical Overview:

Architecture: PI0.5 (PaliGemma 2B + Gemma 300M Action Expert).
Software Stack: ROS 2 and the LeRobot framework.
Inference Hardware: NVIDIA RTX 5090.
Simulation: Gazebo Sim with RGB camera feeds and joint/wrench state feedback.

The policy already shows a strong semantic understanding of the task and good spatial reaching capabilities. I’m currently focused on fine-tuning the real-time control loop and the transition between the approach and the final insertion phase to compensate for inference latency.

Best regards!

bha51 · April 1, 2026, 7:48am

same here tried with both standard lerobot ACT and Diffusion policy, gets roughly to near the port but has very poor accuracy.

Andrew_Garrett · April 1, 2026, 3:13pm

@bha51 I found the same.

I started gathering data for ACT/VLA policy training, but I’m hitting a pretty hard mental wall which is: if there are multiple SCs/NICs (which is possible at test-time), how can our policy know which one to choose? This feels like a strong blocker of any off the shelf VLA fine-tuning approach (i.e. you’d need a substantially larger number of episodes).

Curious if anyone has considered this. Happy to share some of my findings/thoughts.

jlamperez · April 1, 2026, 10:09pm

Hi @Andrew_Garrett

Great point! My approach is to label the dataset with specific text for each connector (sfp, sc) and slot (nic_card_mount_0, mount_1, …).

This forces the VLA to learn spatial grounding : mapping the text ID to a unique pixel region, even if the cards look identical.

However, to make this work, I think I need more data than what I have right now. I’m planning to scale up to 100 episodes per prompt (around 1,000 episodes total). Currently, I only have about 5 episodes per task (around 100 episodes), which is definitely not enough. Also, my initial tests were only 5,000 training steps , and I’m planning to increase that significantly.

Regards!

Topic		Replies	Views
Rocky's Open-Source Build Thread (AI for Industry Challenge) AI for Industry Challenge	29	1164	April 4, 2026
[Case Study] Cross-Morphology Policy Learning with UniVLA and PiPER Robotic Arm ROS General ros2 , ros_control , ros , deep-learning , robotics	0	196	July 31, 2025
AI for Industry Challenge \| Challenge Details AI for Industry Challenge ros2 , gazebo , competition , isaac-sim , mujoco	21	1551	March 20, 2026
Creating an Agentic Skill Library for Robotics, Computer Vision, and Physical AI Manipulation release	3	277	February 6, 2026
The First Large-Scale Agentic Skill Library for Robotics, Computer Vision, and Physical AI Industrial robotics release	0	144	February 10, 2026

My first Results: PI0.5 VLA Policy

Related topics