Hi there ROS community,
My colleague and I are mapping out best practices for managing deployed robot fleets, and we’d love to learn from your real-world experience.
As robots move from the lab into the wild, the process for debugging and resolving issues gets complicated. We’re trying to move past our own ad-hoc methods and are curious about how your teams handle the entire lifecycle of an incident.
Specifically, we’re focused on these four areas:
-
Incident & Resolution Tracking
When a novel issue is solved, how do you capture that hard-won knowledge for the rest of the team? We’re curious about your process for creating a durable record of the diagnostic path and the fix, so the next engineer doesn’t have to solve the same problem from scratch six months from now. -
Hardware & Software Context
How do you correlate a failure with the specific context of the robot? We’ve found it’s often crucial to know the exact firmware of a sensor, the driver version, the OS patch level, or even the manufacturing batch of a component. How do you capture and surface this data during an investigation? -
Remote vs. On-Site Debugging
What is your decision tree for debugging? How much can you solve remotely with the data you have? What are the specific triggers that force you to accept defeat and send a person on-site? What’s the one piece of data you wish you had to avoid that trip? -
Fleet-Wide Failure Analysis
How do you identify systemic issues across your fleet? For example, discovering that a specific component fails more often under certain circumstances. What does your data analysis pipeline look like for finding these patterns—the “what, why, when, and where” of recurring failures?
We’re hoping to get a good public discussion going in this thread about the tools and workflows you’re using today. Whether it’s custom scripts, telegraf, prometheus, grafana dashboards, or something else.
On a separate note, this problem space is our team’s entire focus at INSAION. If you’re wrestling with these challenges daily and find the current tooling inadequate, we’d be very interested to hear your perspective. Please feel free to send me a DM for an honest, engineer-to-engineer conversation.
Keep your robots healthy and running!
Sergi from INSAION