3 Lines of C++ That Give Your Robot a Fault Memory

Michal_Faferek · February 16, 2026, 7:29pm

Your First Fault Manager in 5 Minutes

Part 2 of “Beyond ros2 topic echo /diagnostics” - Part 1 here

Your LiDAR dropped out 47 times yesterday. Loose USB connector, electrical noise from the motors, battery droop under load - the sensor comes back every time, so diagnostic_updater prints ERROR, then OK, then ERROR, then OK. 47 times.

You saw: none of them. Each one vanished before the next refresh. No history, no count, no “when did this start?”

By the end of this post you’ll have a fault manager that persists every fault with full metadata, a 3-line client that reports faults from any node, and a REST API to query fault history without a ROS 2 client. Setup takes about 5 minutes.

linkedin_post_v3

Quick start

Option 1: Docker (recommended, <1 min)

git clone https://github.com/selfpatch/selfpatch_demos.git
cd selfpatch_demos/demos/sensor_diagnostics
./run-demo.sh
# REST API: http://localhost:8080/api/v1/
# Web UI:   http://localhost:3000  (optional - the demo includes a browser dashboard too)

If you use Docker, everything below is already running - skip to Try it yourself if you want to see results first.

Option 2: Build from source

# In your ROS 2 Jazzy workspace
cd ~/ros2_ws/src
git clone https://github.com/selfpatch/ros2_medkit.git
cd ~/ros2_ws
colcon build --packages-up-to ros2_medkit_fault_manager ros2_medkit_fault_reporter
source install/setup.bash

Start the Fault Manager

ros2 run ros2_medkit_fault_manager fault_manager_node \
  --ros-args -p storage_type:=memory

Done. The fault manager is running. It exposes ROS 2 services for reporting, querying, and clearing faults.

storage_type:=memory keeps faults in RAM - good for development. For production, switch to storage_type:=sqlite with a database_path parameter and your faults survive restarts.

Report your first fault

Add ros2_medkit_fault_reporter to your package dependencies, then:

#include "ros2_medkit_fault_reporter/fault_reporter.hpp"

// In your node:
auto reporter = FaultReporter(node, "lidar_driver");

// Something goes wrong:
reporter.report("LIDAR_TIMEOUT",
                Fault::SEVERITY_ERROR,
                "No scan received for 500ms");

// It recovers:
reporter.report_passed("LIDAR_TIMEOUT");

That’s the entire client API. report() tells the fault manager something went wrong. report_passed() tells it the problem resolved. The fault manager handles persistence, lifecycle, timestamps, and occurrence counting.

What happens under the hood

When you call report("LIDAR_TIMEOUT", ...):

FaultReporter calls the /fault_manager/report_fault service
Fault manager creates or updates the fault record
The fault is stored with: code, severity, description, source, first/last occurrence timestamps, occurrence count
If multiple nodes report the same LIDAR_TIMEOUT, their reports are aggregated - one fault, multiple reporting sources

When you call report_passed("LIDAR_TIMEOUT"):

A PASSED event reaches the fault manager
The fault’s state moves toward healing
With healing enabled (healing_enabled: true, off by default), the fault heals after enough PASSED events (default healing_threshold: 3). The Docker demo has this pre-configured.

The fault record stays in the database. You can query it later. It doesn’t vanish.

Back to those 47 LiDAR timeouts: instead of 47 identical terminal lines you get one fault record with occurrence_count: 47, first_occurred and last_occurred timestamps, and - when you enable snapshots - a freeze frame of what the robot was doing at that exact moment: battery voltage, motor state, velocity. Was it a power brownout? electrical noise under load? Now you have the data to tell. We’ll cover snapshots in Part 4 and fault correlation in Part 7.

Fault lifecycle

This is what diagnostic_updater doesn’t have - a state machine for each fault:

PREFAILED - “Might be a problem.” Fault reported but not yet confirmed.
CONFIRMED - “This is real.” Enough evidence to treat it as a confirmed fault.
PREPASSED - “Getting better.” More PASSED than FAILED events - trending toward resolution.
HEALED - “Robot fixed itself.” Enough PASSED events received.
CLEARED - “Operator acknowledged.” Manually cleared via service or API.

With default settings, faults confirm immediately on the first report (confirmation_threshold: -1). That’s fine for getting started. In Part 3 we’ll configure debounce - because right now, every sensor glitch creates a confirmed fault.

Query your faults

Via ROS 2 services

# List all active faults
ros2 service call /fault_manager/list_faults \
  ros2_medkit_msgs/srv/ListFaults

# Get details for one fault
ros2 service call /fault_manager/get_fault \
  ros2_medkit_msgs/srv/GetFault "{fault_code: 'LIDAR_TIMEOUT'}"

# Clear a fault (operator acknowledgment)
ros2 service call /fault_manager/clear_fault \
  ros2_medkit_msgs/srv/ClearFault "{fault_code: 'LIDAR_TIMEOUT'}"

Via REST API

If you’re running the gateway (included in the Docker demo):

# List all faults across the system
curl http://localhost:8080/api/v1/faults | jq

# Faults are scoped to the reporting app - use the entity path:
curl http://localhost:8080/api/v1/apps/diagnostic-bridge/faults/LIDAR_SIM | jq

# Clear
curl -X DELETE http://localhost:8080/api/v1/apps/diagnostic-bridge/faults/LIDAR_SIM

No ROS 2 dependency on the client side. Any HTTP client works - a web dashboard, a fleet manager, a PagerDuty webhook, your phone’s browser.

Try it yourself

Using the Docker demo:

# Check current faults (should be clean)
curl http://localhost:8080/api/v1/faults | jq

# Inject a LiDAR failure (100% of scans will fail)
curl -X PUT http://localhost:8080/api/v1/apps/lidar-sim/configurations/failure_probability \
  -H "Content-Type: application/json" -d '{"value": 1.0}'

# Wait a few seconds, then check - fault appeared
# ?status=all includes all lifecycle states (confirmed, healed, etc.)
curl "http://localhost:8080/api/v1/faults?status=all" | jq

# Stop the failure - but DON'T clear the fault
curl -X PUT http://localhost:8080/api/v1/apps/lidar-sim/configurations/failure_probability \
  -H "Content-Type: application/json" -d '{"value": 0.0}'

# Wait for healing - with 50 FAILED events, the debounce counter needs
# 53 PASSED events to reach the healing threshold. At 10 Hz that's ~5 seconds.
sleep 10
curl "http://localhost:8080/api/v1/faults?status=all" | jq

If you prefer a browser: the demo also includes a Web UI at http://localhost:3000 (enable the “All” status filter to see healed faults).

The key difference: ros2 topic echo /diagnostics goes quiet after recovery - zero trace anything happened. The fault record stays - with timestamps, occurrence count, and the full state history.

Next: Part 3 - Taming the Fault Storm: Debounce & Filtering

Part 1: ROS 2 Diagnostics Are Stuck in 2010

GitHub: selfpatch/ros2_medkit (Apache 2.0, ROS 2 Jazzy)
Demo repo: selfpatch/selfpatch_demos

Topic		Replies	Views
What's missing from ROS 2 diagnostics (and what we built) Projects ros2 , jazzy , diagnostics	2	248	February 13, 2026
Ros2_medkit: API-first diagnostics for ROS 2 Projects ros2	1	426	December 22, 2025
How do you monitor your robot diagnostics (topic rates)? ROS General ros2	25	2358	January 10, 2025
SensorGuardian - Open Source ROS 2 Diagnostics Tool Projects ros2	0	111	January 28, 2026
Diagnostic-aggregator and diagnostic-updater porting to ROS2 ROS General ros2	15	5343	January 24, 2019