Your Robot's Fault Log Is a Haystack. The Real Error Is the Needle

Part 3 of “Beyond ros2 topic echo /diagnostics
ros2_medkit

10 faults just fired on your robot.

SENSOR_NOISE_LIDAR. SENSOR_NOISE_IMU. MOTOR_OVERHEAT. SENSOR_NOISE_CAMERA. NAV_TIMEOUT_CONTROLLER. SENSOR_NOISE_ODOM. NAV_TIMEOUT_PLANNER. NAV_TIMEOUT_COSTMAP. NAV_TIMEOUT_TF. COMM_LATENCY_WIFI.

All warning level. All confirmed. All look identical on your dashboard.

One of them is a motor about to burn out. Can you spot it?

gif3_discourse

This is alert fatigue. Every monitoring system has solved it.

Prometheus has for duration. PagerDuty has deduplication. Your car’s ECU has “pending” vs “confirmed” DTCs - it won’t turn on the check engine light until the same fault shows up across two drive cycles. That’s debounce. Without it, every pothole would trigger a dashboard warning.

diagnostic_updater has none of this. Every message is independent. There’s no memory between updates, no threshold, no “wait and see if this is real.” Plug it into any alerting system and you get a firehose.

ros2_medkit has four layers of filtering between a sensor glitch and a confirmed fault. Here’s what they do and how to tune them.

Layer 1: Local filter (client side)

Before a fault report even leaves your node, the FaultReporter can filter it locally.

# In your node's parameters
fault_reporter:
  ros__parameters:
    local_filtering:
      enabled: true
      default_threshold: 3       # need 3 occurrences...
      default_window_sec: 10.0   # ...within 10 seconds
      bypass_severity: 2         # ERROR and above skip the filter

How it works: the reporter keeps a sliding window of timestamps per fault code. SENSOR_NOISE_LIDAR has to fire 3 times within 10 seconds before the report is forwarded to the fault manager. A single glitch? Dropped silently.

bypass_severity: 2 means SEVERITY_ERROR and SEVERITY_CRITICAL skip the filter entirely - if something is genuinely broken you don’t want to wait.

Why filter on the client? Network. If you have 20 nodes each spamming fault reports at 10 Hz, that’s 200 service calls per second hitting the fault manager. Local filtering is your first line of defense.

Layer 2: Confirmation debounce (fault manager)

Even after passing the local filter, a fault isn’t immediately real. The fault manager uses a counter to decide.

# Fault manager parameters
fault_manager:
  ros__parameters:
    confirmation_threshold: -3    # need 3 FAILED events to confirm
    healing_enabled: true
    healing_threshold: 3          # need 3 PASSED events to heal

The counter starts at 0. Every FAILED event decrements it (-1, -2, -3…). Every PASSED event increments it (+1, +2, +3…). When the counter hits confirmation_threshold (-3), the fault is CONFIRMED. When it climbs back to healing_threshold (+3), the fault is HEALED.

This means a sensor that oscillates ERROR/OK/ERROR/OK never reaches the threshold - the PASSED events keep pulling the counter back up. Only a sustained failure gets through.

Counter:  0  -1   0  -1  -2  -1  -2  -3  ← CONFIRMED
Events:   F   F   P   F   F   P   F   F
          ↑               ↑
        glitch          real problem starts

SEVERITY_CRITICAL bypasses this - it confirms immediately. Because if the motor is on fire, you don’t wait for 3 data points.

Layer 3: Healing

When the problem resolves, PASSED events start flowing. The counter climbs back:

Counter: -3  -2  -1   0  +1  +2  +3  ← HEALED
Events:   P   P   P   P   P   P   P

The fault heals automatically. No operator intervention needed. The robot fixed itself and the system recorded the whole episode - when it started, when it confirmed, when it healed, how many events total.

Without healing, you’d need an operator to manually clear every fault. At 3 AM. On a robot that already recovered on its own.

Layer 4: Auto-confirm timeout

Some faults are slow. A sensor drifts out of spec over minutes, reporting one FAILED event every 30 seconds. With confirmation_threshold: -3, it would take 90 seconds of sustained failure before confirmation.

Sometimes that’s too slow:

auto_confirm_after_sec: 30.0

If a fault stays in PREFAILED state for 30 seconds without enough PASSED events to resolve it, it auto-confirms. This catches the slow-burn problems that slip under the threshold-based approach.

Set to 0.0 to disable (default).

The haystack vs the needle

Without filtering (threshold: 0):

  • 10 faults fire, all CONFIRMED instantly
  • Dashboard: wall of identical warnings
  • Operator: “which one matters?” → mutes alerts
  • MOTOR_OVERHEAT buried in SENSOR_NOISE and NAV_TIMEOUT

With debounce (threshold: -3):

  • Same 10 faults fire
  • 9 stay PENDING (transient - counter at -1, never reaches -3)
  • 1 CONFIRMED: MOTOR_OVERHEAT (5 sustained reports → counter hits -3)
  • Dashboard: one clear signal. The needle.

The data is still there. Click “show pending” and you see all 9 transient faults. Nothing is lost. It just doesn’t scream at you anymore.

Try it yourself

You just need ros2_medkit built in your workspace. No simulation required.

Terminal 1 - Fault Manager (storm mode, no debounce):

ros2 run ros2_medkit_fault_manager fault_manager_node \
  --ros-args -p storage_type:=memory -p confirmation_threshold:=0

Terminal 2 - Gateway (for REST API):

ros2 launch ros2_medkit_gateway gateway.launch.py

Terminal 3 - Fire the storm:

# Download the storm script
curl -O https://raw.githubusercontent.com/selfpatch/selfpatch_demos/main/demos/turtlebot3_integration/scripts/fault_storm.py

# Fire 9 noise faults (1 report each) + 1 real fault (5 sustained reports)
python3 fault_storm.py

# Check results
curl "http://localhost:8080/api/v1/faults?status=all" | jq
# → All 10 faults CONFIRMED. Wall of identical warnings.

Now restart the fault manager with debounce and fire the same storm:

# Terminal 1: restart with debounce
ros2 run ros2_medkit_fault_manager fault_manager_node \
  --ros-args -p storage_type:=memory -p confirmation_threshold:=-3

# Terminal 3: same storm
python3 fault_storm.py

curl "http://localhost:8080/api/v1/faults?status=all" | jq
# → 1 CONFIRMED (MOTOR_OVERHEAT). 9 PREFAILED. The needle.

Same faults. One YAML change.

Want the dashboard? Add the Web UI to see the difference visually - wall of orange vs one clear signal:

docker run -d -p 3000:80 ghcr.io/selfpatch/sovd_web_ui:latest
# Open http://localhost:3000, connect to http://localhost:8080

Want the full simulation? The TurtleBot3 demo includes Gazebo, Nav2, and the Web UI - run ./run-demo.sh or ./run-demo-debounce.sh.

Tuning for your robot

There’s no universal “right” threshold. It depends on your sensor, your environment, and how much noise you’re willing to tolerate.

Scenario Local threshold Confirmation Healing Auto-confirm
Clean lab environment 1 in 5s -1 (immediate) 1 off
Noisy outdoor sensors 5 in 10s -3 5 30s
Safety-critical system 1 in 5s -1 10 5s
High-frequency data (100+ Hz) 10 in 5s -5 10 off

Start with defaults, tune after you see real data. The fault history gives you exactly the data you need to decide - how often does this fault fire? How long does it stay? Does it oscillate?


Next: Part 4 - Flight Recorder: Snapshots & Rosbag on Fault

Part 1: ROS 2 Diagnostics Are Stuck in 2010 · Part 2: Your First Fault Manager in 5 Minutes


GitHub: selfpatch/ros2_medkit

Demo repo: selfpatch/selfpatch_demos

4 Likes