Your First Fault Manager in 5 Minutes
Part 2 of “Beyond ros2 topic echo /diagnostics” - Part 1 here
Your LiDAR dropped out 47 times yesterday. Loose USB connector, electrical noise from the motors, battery droop under load - the sensor comes back every time, so diagnostic_updater prints ERROR, then OK, then ERROR, then OK. 47 times.
You saw: none of them. Each one vanished before the next refresh. No history, no count, no “when did this start?”
By the end of this post you’ll have a fault manager that persists every fault with full metadata, a 3-line client that reports faults from any node, and a REST API to query fault history without a ROS 2 client. Setup takes about 5 minutes.

Quick start
Option 1: Docker (recommended, <1 min)
git clone https://github.com/selfpatch/selfpatch_demos.git
cd selfpatch_demos/demos/sensor_diagnostics
./run-demo.sh
# REST API: http://localhost:8080/api/v1/
# Web UI: http://localhost:3000 (optional - the demo includes a browser dashboard too)
If you use Docker, everything below is already running - skip to Try it yourself if you want to see results first.
Option 2: Build from source
# In your ROS 2 Jazzy workspace
cd ~/ros2_ws/src
git clone https://github.com/selfpatch/ros2_medkit.git
cd ~/ros2_ws
colcon build --packages-up-to ros2_medkit_fault_manager ros2_medkit_fault_reporter
source install/setup.bash
Start the Fault Manager
ros2 run ros2_medkit_fault_manager fault_manager_node \
--ros-args -p storage_type:=memory
Done. The fault manager is running. It exposes ROS 2 services for reporting, querying, and clearing faults.
storage_type:=memory keeps faults in RAM - good for development. For production, switch to storage_type:=sqlite with a database_path parameter and your faults survive restarts.
Report your first fault
Add ros2_medkit_fault_reporter to your package dependencies, then:
#include "ros2_medkit_fault_reporter/fault_reporter.hpp"
// In your node:
auto reporter = FaultReporter(node, "lidar_driver");
// Something goes wrong:
reporter.report("LIDAR_TIMEOUT",
Fault::SEVERITY_ERROR,
"No scan received for 500ms");
// It recovers:
reporter.report_passed("LIDAR_TIMEOUT");
That’s the entire client API. report() tells the fault manager something went wrong. report_passed() tells it the problem resolved. The fault manager handles persistence, lifecycle, timestamps, and occurrence counting.
What happens under the hood
When you call report("LIDAR_TIMEOUT", ...):
- FaultReporter calls the
/fault_manager/report_faultservice - Fault manager creates or updates the fault record
- The fault is stored with: code, severity, description, source, first/last occurrence timestamps, occurrence count
- If multiple nodes report the same
LIDAR_TIMEOUT, their reports are aggregated - one fault, multiple reporting sources
When you call report_passed("LIDAR_TIMEOUT"):
- A PASSED event reaches the fault manager
- The fault’s state moves toward healing
- With healing enabled (
healing_enabled: true, off by default), the fault heals after enough PASSED events (defaulthealing_threshold: 3). The Docker demo has this pre-configured.
The fault record stays in the database. You can query it later. It doesn’t vanish.
Back to those 47 LiDAR timeouts: instead of 47 identical terminal lines you get one fault record with occurrence_count: 47, first_occurred and last_occurred timestamps, and - when you enable snapshots - a freeze frame of what the robot was doing at that exact moment: battery voltage, motor state, velocity. Was it a power brownout? electrical noise under load? Now you have the data to tell. We’ll cover snapshots in Part 4 and fault correlation in Part 7.
Fault lifecycle
This is what diagnostic_updater doesn’t have - a state machine for each fault:
- PREFAILED - “Might be a problem.” Fault reported but not yet confirmed.
- CONFIRMED - “This is real.” Enough evidence to treat it as a confirmed fault.
- PREPASSED - “Getting better.” More PASSED than FAILED events - trending toward resolution.
- HEALED - “Robot fixed itself.” Enough PASSED events received.
- CLEARED - “Operator acknowledged.” Manually cleared via service or API.
With default settings, faults confirm immediately on the first report (confirmation_threshold: -1). That’s fine for getting started. In Part 3 we’ll configure debounce - because right now, every sensor glitch creates a confirmed fault.
Query your faults
Via ROS 2 services
# List all active faults
ros2 service call /fault_manager/list_faults \
ros2_medkit_msgs/srv/ListFaults
# Get details for one fault
ros2 service call /fault_manager/get_fault \
ros2_medkit_msgs/srv/GetFault "{fault_code: 'LIDAR_TIMEOUT'}"
# Clear a fault (operator acknowledgment)
ros2 service call /fault_manager/clear_fault \
ros2_medkit_msgs/srv/ClearFault "{fault_code: 'LIDAR_TIMEOUT'}"
Via REST API
If you’re running the gateway (included in the Docker demo):
# List all faults across the system
curl http://localhost:8080/api/v1/faults | jq
# Faults are scoped to the reporting app - use the entity path:
curl http://localhost:8080/api/v1/apps/diagnostic-bridge/faults/LIDAR_SIM | jq
# Clear
curl -X DELETE http://localhost:8080/api/v1/apps/diagnostic-bridge/faults/LIDAR_SIM
No ROS 2 dependency on the client side. Any HTTP client works - a web dashboard, a fleet manager, a PagerDuty webhook, your phone’s browser.
Try it yourself
Using the Docker demo:
# Check current faults (should be clean)
curl http://localhost:8080/api/v1/faults | jq
# Inject a LiDAR failure (100% of scans will fail)
curl -X PUT http://localhost:8080/api/v1/apps/lidar-sim/configurations/failure_probability \
-H "Content-Type: application/json" -d '{"value": 1.0}'
# Wait a few seconds, then check - fault appeared
# ?status=all includes all lifecycle states (confirmed, healed, etc.)
curl "http://localhost:8080/api/v1/faults?status=all" | jq
# Stop the failure - but DON'T clear the fault
curl -X PUT http://localhost:8080/api/v1/apps/lidar-sim/configurations/failure_probability \
-H "Content-Type: application/json" -d '{"value": 0.0}'
# Wait for healing - with 50 FAILED events, the debounce counter needs
# 53 PASSED events to reach the healing threshold. At 10 Hz that's ~5 seconds.
sleep 10
curl "http://localhost:8080/api/v1/faults?status=all" | jq
If you prefer a browser: the demo also includes a Web UI at http://localhost:3000 (enable the “All” status filter to see healed faults).
The key difference: ros2 topic echo /diagnostics goes quiet after recovery - zero trace anything happened. The fault record stays - with timestamps, occurrence count, and the full state history.
Next: Part 3 - Taming the Fault Storm: Debounce & Filtering
Part 1: ROS 2 Diagnostics Are Stuck in 2010
GitHub: selfpatch/ros2_medkit (Apache 2.0, ROS 2 Jazzy)
Demo repo: selfpatch/selfpatch_demos
