What's missing from ROS 2 diagnostics (and what we built)

Michal_Faferek · February 12, 2026, 5:24pm

ROS 2 Diagnostics Are Stuck in 2010

Part 1 of “Beyond ros2 topic echo /diagnostics” series on production diagnostics for ROS 2.

We’ve all been there. Robot stops moving in the field. You SSH in, run ros2 topic echo /diagnostics, and get a wall of text scrolling faster than you can read…

Something flashes ERROR for half a second. Then it’s OK again. Then ERROR.

You’re not sure if you’re looking at a real problem or sensor noise. There’s no history, no context, nothing to go back to. The fault is gone before you can even copy-paste it.

202602120805 (1)

diagnostic_updater is honest about what it is

REP 107 doesn’t pretend to be something it’s not:

“The diagnostics system is designed for collecting, publishing, and reporting hardware diagnostics data from a robot.”

and:

“The intended consumer of diagnostics data is a person.”

That’s it. Hardware data, for a human looking at a screen. No API, no persistence, no fault lifecycle. diagnostic_updater is a port from ROS 1, designed for a time when one person operated one robot in a lab. And honestly, for that use case it’s fine.

But we’re not in 2010 anymore. Robots ship to customers. They run in warehouses, on farms, on construction sites. A typical robot in 2010 had maybe a laser scanner and a few joint encoders - a handful of topics at low frequency. Now you’ve got 3D LiDAR, stereo cameras, IMUs, GPS, force/torque sensors, all publishing at hundreds of Hz. Good luck scrolling through that in a terminal. When something breaks at 3 AM, nobody’s there to stare at it anyway.

What’s actually missing

What you need in production	What you get today
✅ Structured fault codes with severity	❌ 4 levels + a string
✅ Fault history	❌ Pub/sub. Blink and it's gone.
✅ Fault lifecycle (report → confirm → heal → clear)	❌ Stateless. Every glitch is an event, no debounce, no filtering.
✅ REST API for dashboards, fleet tools, alerting	❌ `rosbridge` WebSocket or a full ROS 2 client
✅ Root cause analysis	❌ Nothing
✅ Automatic data capture on fault	❌ `rosbag record` and pray

This isn’t some wish list for the future. Every serious production system has these things. Your car has had them since 1996 (OBD-II).

Cars solved this. Why haven’t we?

Take your car to the mechanic. They plug in a reader and get:

a fault code (P0301 - cylinder 1 misfire)
a freeze frame of what the engine was doing when it happened
a history of when it first occurred
a way to clear it after repair

That’s not rocket science. It’s just standardized diagnostics.

The automotive world has been on this path for decades. Their latest standard (SOVD from ASAM) drops the old binary CAN protocols and uses plain HTTP/REST. A JSON API for vehicle diagnostics. Sounds familiar? It should, because that’s what every web developer already knows how to use.

Robots today remind me of cars in the early 2000s. Complex enough that things break in weird ways, but still diagnosing problems with the equivalent of sticking your head under the hood and listening.

We’re working on this

We spent last few months building ros2_medkit - it’s our attempt at filling these gaps for ROS 2:

Fault lifecycle with debounce and filtering
Automatic rosbag capture on fault
REST/SSE gateway - no ROS 2 client needed
Root cause correlation

Open source, Apache 2.0, runs on Jazzy.

This is the first post from a series where we’ll walk through all of it. Each post will have a Docker demo you can spin up and try. No GPU needed, no complicated setup.

Navigation got Nav2. Manipulation got MoveIt2. We think diagnostics deserves its own stack too.

Next episode: Part 2 - Your First Fault Manager in 5 Minutes

tfoote · February 13, 2026, 9:06pm

It’s great to see that you’re pushing forward to add introspection capabilities. I am very happy to see that.

However I would recommend you do apples to apples comparisons for your promotional material. Your non-tirivial amount of your post which is falsely trivializing past work (comparing a low level reporting mechanism to a gui front end), and ignoring prior work decreases the credibility of the rest of your work.

Even back in 2010 all the sensors that you list existed and were actively used in ROS systems. And not only that, there were GUIs with basically all the features that you mention including aggregators collapsing views and time histories that you could pause and scrub back in time. And highlight error and warning messages to the user.

And specific status:

Source:

Michal_Faferek · February 13, 2026, 9:37pm

You’re right, and I should have been clearer about what I’m comparing to what
rqt_robot_monitor and the diagnostic_aggregator are solid tools, I use them myself and I didn’t mean to dismiss that work.

The point I was trying to make isn’t about the visualization layer. rqt_robot_monitor does a good job showing you what’s happening right now. The gap I see is underneath that:

Hardware vs software diagnostics - REP 107 and rqt_robot_monitor are built around hardware monitoring: CPU load, sensor status, temperatures. But what about software faults? A navigation algorithm failing, a planning node crashing, a behavior tree entering an unexpected state, there’s no structured way to report and track those in the current diagnostic stack.
Fault lifecycle - REP 107 gives you OK/WARN/ERROR/STALE as a current snapshot. There’s no concept of a fault being reported, confirmed after debounce, healed, or cleared. Every status update is independent. If a sensor is flaky and toggles ERROR/OK 50 times in a minute, you get 50 callbacks with no filtering
Persistence - when rqt_robot_monitor shows an error and you close the window, it’s gone. There’s no fault history you can query later. “What faults happened last Tuesday between 2 and 3 PM?” - you can’t answer that without external logging.
Snapshots - “what was the robot doing when this fault happened?” In automotive, a DTC comes with a freeze frame (engine RPM, coolant temp, speed at the moment of failure). In ROS 2 there’s nothing like that. You’d need someone to have been running rosbag at the right time. We should be able to capture topic data automatically when a fault confirms.
Programmatic access - if a fleet management system or a CI pipeline needs to check robot health, it has to subscribe to DDS/ROS topics. There’s no REST endpoint, no way to query from outside the ROS ecosystem.

That’s the layer I’m comparing to automotive (OBD-II/UDS/SOVD) not the GUI (sorry if I was unclear and created confusion ). SOVD isn’t a dashboard, it’s a structured diagnostic data model with HTTP access, fault history, environment snapshots, and lifecycle states.

Thanks for your opinion!

Topic		Replies	Views
Ros2_medkit: API-first diagnostics for ROS 2 Projects ros2	1	426	December 22, 2025
Diagnostic-aggregator and diagnostic-updater porting to ROS2 ROS General ros2	15	5343	January 24, 2019
How do you monitor your robot diagnostics (topic rates)? ROS General ros2	25	2358	January 10, 2025
Additional levels in DiagnosticStatus ROS General	22	955	March 7, 2025
ROS2 diagnostics from Jetson with jrosclient and isaac_ros_jetson_stats ROS General release	0	224	October 30, 2024