Here are my first results for the AI for Industry Challenge . I’ve been working on an end-to-end VLA (Vision-Language-Action) policy to tackle the autonomous cable insertion task.
In this video, the robotic arm uses a PI0.5 architecture to interpret a natural language command and attempt the insertion of an SC/SFP connector in a Gazebo simulation, though, as you can clearly see, it doesn’t actually manage to plug any of them in yet!
Simulation: Gazebo Sim with RGB camera feeds and joint/wrench state feedback.
The policy already shows a strong semantic understanding of the task and good spatial reaching capabilities. I’m currently focused on fine-tuning the real-time control loop and the transition between the approach and the final insertion phase to compensate for inference latency.
I started gathering data for ACT/VLA policy training, but I’m hitting a pretty hard mental wall which is: if there are multiple SCs/NICs (which is possible at test-time), how can our policy know which one to choose? This feels like a strong blocker of any off the shelf VLA fine-tuning approach (i.e. you’d need a substantially larger number of episodes).
Curious if anyone has considered this. Happy to share some of my findings/thoughts.
Great point! My approach is to label the dataset with specific text for each connector (sfp, sc) and slot (nic_card_mount_0, mount_1, …).
This forces the VLA to learn spatial grounding : mapping the text ID to a unique pixel region, even if the cards look identical.
However, to make this work, I think I need more data than what I have right now. I’m planning to scale up to 100 episodes per prompt (around 1,000 episodes total). Currently, I only have about 5 episodes per task (around 100 episodes), which is definitely not enough. Also, my initial tests were only 5,000 training steps , and I’m planning to increase that significantly.