What's missing from ROS 2 diagnostics (and what we built)

ROS 2 Diagnostics Are Stuck in 2010

Part 1 of “Beyond ros2 topic echo /diagnostics” series on production diagnostics for ROS 2.


We’ve all been there. Robot stops moving in the field. You SSH in, run ros2 topic echo /diagnostics, and get a wall of text scrolling faster than you can read…

Something flashes ERROR for half a second. Then it’s OK again. Then ERROR.

You’re not sure if you’re looking at a real problem or sensor noise. There’s no history, no context, nothing to go back to. The fault is gone before you can even copy-paste it.

202602120805 (1)

diagnostic_updater is honest about what it is

REP 107 doesn’t pretend to be something it’s not:

“The diagnostics system is designed for collecting, publishing, and reporting hardware diagnostics data from a robot.”

and:

“The intended consumer of diagnostics data is a person.”

That’s it. Hardware data, for a human looking at a screen. No API, no persistence, no fault lifecycle. diagnostic_updater is a port from ROS 1, designed for a time when one person operated one robot in a lab. And honestly, for that use case it’s fine.

But we’re not in 2010 anymore. Robots ship to customers. They run in warehouses, on farms, on construction sites. A typical robot in 2010 had maybe a laser scanner and a few joint encoders - a handful of topics at low frequency. Now you’ve got 3D LiDAR, stereo cameras, IMUs, GPS, force/torque sensors, all publishing at hundreds of Hz. Good luck scrolling through that in a terminal. When something breaks at 3 AM, nobody’s there to stare at it anyway.

What’s actually missing

What you need in production What you get today
Structured fault codes with severity ❌ 4 levels + a string
Fault history ❌ Pub/sub. Blink and it's gone.
Fault lifecycle (report → confirm → heal → clear) ❌ Stateless. Every glitch is an event, no debounce, no filtering.
REST API for dashboards, fleet tools, alerting rosbridge WebSocket or a full ROS 2 client
Root cause analysis ❌ Nothing
Automatic data capture on fault rosbag record and pray

This isn’t some wish list for the future. Every serious production system has these things. Your car has had them since 1996 (OBD-II).

Cars solved this. Why haven’t we?

Take your car to the mechanic. They plug in a reader and get:

  • a fault code (P0301 - cylinder 1 misfire)

  • a freeze frame of what the engine was doing when it happened

  • a history of when it first occurred

  • a way to clear it after repair

That’s not rocket science. It’s just standardized diagnostics.

The automotive world has been on this path for decades. Their latest standard (SOVD from ASAM) drops the old binary CAN protocols and uses plain HTTP/REST. A JSON API for vehicle diagnostics. Sounds familiar? It should, because that’s what every web developer already knows how to use.

Robots today remind me of cars in the early 2000s. Complex enough that things break in weird ways, but still diagnosing problems with the equivalent of sticking your head under the hood and listening.

We’re working on this

We spent last few months building ros2_medkit - it’s our attempt at filling these gaps for ROS 2:

  • Fault lifecycle with debounce and filtering

  • Automatic rosbag capture on fault

  • REST/SSE gateway - no ROS 2 client needed

  • Root cause correlation

Open source, Apache 2.0, runs on Jazzy.

This is the first post from a series where we’ll walk through all of it. Each post will have a Docker demo you can spin up and try. No GPU needed, no complicated setup.

Navigation got Nav2. Manipulation got MoveIt2. We think diagnostics deserves its own stack too.

Next episode: Part 2 - Your First Fault Manager in 5 Minutes

7 Likes

It’s great to see that you’re pushing forward to add introspection capabilities. I am very happy to see that.

However I would recommend you do apples to apples comparisons for your promotional material. Your non-tirivial amount of your post which is falsely trivializing past work (comparing a low level reporting mechanism to a gui front end), and ignoring prior work decreases the credibility of the rest of your work.

Even back in 2010 all the sensors that you list existed and were actively used in ROS systems. And not only that, there were GUIs with basically all the features that you mention including aggregators collapsing views and time histories that you could pause and scrub back in time. And highlight error and warning messages to the user.

And specific status:

Source:

2 Likes

You’re right, and I should have been clearer about what I’m comparing to what :sweat_smile:
rqt_robot_monitor and the diagnostic_aggregator are solid tools, I use them myself and I didn’t mean to dismiss that work.

The point I was trying to make isn’t about the visualization layer. rqt_robot_monitor does a good job showing you what’s happening right now. The gap I see is underneath that:

  • Hardware vs software diagnostics - REP 107 and rqt_robot_monitor are built around hardware monitoring: CPU load, sensor status, temperatures. But what about software faults? A navigation algorithm failing, a planning node crashing, a behavior tree entering an unexpected state, there’s no structured way to report and track those in the current diagnostic stack.

  • Fault lifecycle - REP 107 gives you OK/WARN/ERROR/STALE as a current snapshot. There’s no concept of a fault being reported, confirmed after debounce, healed, or cleared. Every status update is independent. If a sensor is flaky and toggles ERROR/OK 50 times in a minute, you get 50 callbacks with no filtering :see_no_evil_monkey:

  • Persistence - when rqt_robot_monitor shows an error and you close the window, it’s gone. There’s no fault history you can query later. “What faults happened last Tuesday between 2 and 3 PM?” - you can’t answer that without external logging.

  • Snapshots - “what was the robot doing when this fault happened?” In automotive, a DTC comes with a freeze frame (engine RPM, coolant temp, speed at the moment of failure). In ROS 2 there’s nothing like that. You’d need someone to have been running rosbag at the right time. We should be able to capture topic data automatically when a fault confirms.

  • Programmatic access - if a fleet management system or a CI pipeline needs to check robot health, it has to subscribe to DDS/ROS topics. There’s no REST endpoint, no way to query from outside the ROS ecosystem.

That’s the layer I’m comparing to automotive (OBD-II/UDS/SOVD) not the GUI (sorry if I was unclear and created confusion :folded_hands:). SOVD isn’t a dashboard, it’s a structured diagnostic data model with HTTP access, fault history, environment snapshots, and lifecycle states.

Thanks for your opinion!

1 Like