Stop “Ghost Commands” on Bad Wi-Fi: ros2_kinematic_guard for `/cmd_vel` safety

Hi ROS Community,

Have you ever tuned DDS QoS for hours, only to see a mobile robot execute a “ghost command” after a bad Wi-Fi / 5G spike?

I’ve been working on a small ROS 2 project called ros2_kinematic_guard to address what I call Wireless Command Collapse.

The issue is not just packet loss. Sometimes the message arrives, but it is no longer safe or meaningful to execute by the time it reaches the robot.

The Problem: when timing betrays motion

A heartbeat or timeout can tell you whether messages are still arriving. It usually cannot tell you whether a /cmd_vel command still makes sense relative to the robot’s current odometry.

Common failure modes:

  • Ghost commands: an old command arrives late and gets executed after the robot state has already changed.
  • Burst / jitter windows: delayed commands are released together, causing abnormal acceleration or jerk demand.
  • QoS traps: certain reliability / history settings can turn network lag into a burst of outdated motion commands.

The Solution: ros2_kinematic_guard

ros2_kinematic_guard does not try to fix DDS, QoS, or the network.

It sits between the incoming command stream and the robot controller:

/cmd_vel_in
    ↓
NARH Guard
    ↓
/cmd_vel_out

The guard evaluates a short local window:

previous command
current command
previous odometry
current odometry

It computes a lightweight NARH-lite residual, R_NAR, based on timing consistency, stale-command risk, acceleration / jerk limits, and command-vs-odom consistency.

When R_NAR crosses a critical threshold, the guard enters a deterministic state machine:

RED_BRAKE → BRAKE_AND_RESYNC → RESYNCING

That means it cuts motion, flushes poisoned command windows, waits for a fresh command/odom window, and only then releases control.

Why not just use heartbeat / timeout?

Failure Mode Traditional Heartbeat / Timeout NARH Kinematic Guard
Packet Loss Can detect silence Can detect silence and brake
Stale Command Often missed Detected via kinematic inconsistency
Burst / Jitter Often missed Detected via residual spike / dt collapse
Stale Replay Often missed Detected via timing + odom conflict
Recovery Time-based Resync gate with fresh command/odom window

Try it in 30 seconds, no robot required

The repo includes a complete Bad-Wi-Fi pressure test loop:

  • jitter_injector_node.py: creates delayed / duplicated / bursty / replayed commands
  • kinematic_guard_node.py: computes R_NAR and outputs protected safe_cmd_vel
  • synthetic_odom_provider.py: acts as a virtual robot body and publishes /odom

Run:

source /opt/ros/humble/setup.bash
colcon build --symlink-install
source install/setup.bash

ros2 launch ros2_kinematic_guard start_pressure_test.launch.py profile:=wifi_collapse

Then watch:

ros2 topic echo /kinematic_guard/status
ros2 topic echo /kinematic_guard/residual

Example guard response:

{
  "status": "RESYNCING",
  "action": "BRAKE_AND_RESYNC",
  "r_nar": 5.749,
  "safe_cmd": {
    "vx": 0.0,
    "wz": 0.0
  }
}

At the default 20 Hz guard loop, intervention happens on the next guard tick, around 50 ms.

Repository:

Background

This project is an engineering projection of NARH — the Non-Associative Residual Hypothesis. In this ROS 2 version, NARH is used as a lightweight command-flow consistency metric, not as a full dynamics solver.

I’d love feedback from anyone running mobile robots over unreliable Wi-Fi / 5G links.

Especially interested in real /cmd_vel + /odom bag / MCAP logs showing strange jitter, stale command behavior, burst delivery, or command-flow anomalies.

1 Like

Oops. Someone is really publishing cmd_vel over wifi? What’s the use-case? I just remember we did some teleoperation over wifi, but it was quite unreliable, so we switched to (relative) positional control instead of direct velocity control.

Also, instant stop isn’t a safe action for many robots. But yes, there are lots of those that can withstand instant stops.

Good points. On /cmd_vel over Wi-Fi: relative positional control is a robust alternative for high-latency links, but it is not always a practical drop-in fix for the current ROS 2 ecosystem.
A lot of existing mobile robot stacks are still built around velocity streams:

  • Nav2-style local planners commonly output geometry_msgs/Twist
  • many mobile-base drivers expose /cmd_vel as the main control interface
  • recovery teleop, web dashboards, and research platforms often still use velocity commands over wireless links

So my goal is not to claim that velocity-over-Wi-Fi is the ideal architecture. It is to recognize that this pattern exists in the field and add a low-friction guard in front of it. ros2_kinematic_guard is meant to act as a plug-in kinematic sanity gate for existing systems.
On instant stop: agreed. RED_BRAKE should not mean “slam every robot to zero instantly.” In this project, RED_BRAKE means the incoming command stream is no longer trusted. The guard stops forwarding poisoned commands and switches to a safe output path. For a 1 kg test robot, that may be zero velocity. For a 500 kg AMR, it should map to a controlled deceleration ramp or a lower-level safe-state handler.

The key point is:

/cmd_vel may still be syntactically valid, but no longer kinematically executable.

1 Like

For remote teleoperation, multiple safety mechanisms should be put in place. The concern you describe is addressed by stamping control commands with the (robot’s) timestamp of the last video frame received by the (remote) operator before issuing the command, and ignoring any commands with timestamps that are too old (say 500 ms).

This is non-trivial, so let’s unpack this a bit:

  • You cannot safely compare timestamps from robot and operator device clocks, so all timestamps you want to use for comparison need to come from the same device, here the robot.
  • This is why you need to forward the timestamp of video frames to the operator.
  • The operator’s device just acknowledges that timestamp and sends it back with the actual control command (e.g., “go forward, based on video frame with timestamp 1778439071877”).
  • When receiving that control message back on the robot, we can compare the provided timestamp with the robot’s clock and decide whether it is still “fresh” enough.

This setup seems correct in the sense that the rationale of the operator to request the described action (here “go forward”) was based on that last image they saw. Strictly speaking it might be safer to account for human reaction time, too, and use a slightly older video frame timestamp, but let’s not go there right now.

Missing this feature can be quite dangerous in practice, and there are more pitfalls like it, so I urge people not to try and build their own remote teleop for fielded robots. We’ve just seen what happened to Yarbo

You can find all the safety features we believe are required here:

You’ve described the “ideal” closed-loop teleoperation protocol. Linking control commands back to the video frame’s timestamp is indeed the robust way to ensure semantic freshness.

However, my project focuses on the “Engineering Reality” of existing deployments:

  1. Legacy & Third-party Constraints: Many engineers are using off-the-shelf mobile base drivers (e.g., standard ROS 2 motor controllers) that only accept a simple geometry_msgs/Twist. These drivers often lack the internal logic to handle the video-timestamp-handshake you described.
  2. Beyond Teleop: Jitter-induced “Ghost Commands” don’t just happen in manual teleop. They also occur in autonomous navigation when the local planner (running on a separate compute node) experiences network spikes or CPU contention while sending velocity streams to the base.
  3. Defense in Depth: Even with a perfect timestamp protocol, a hardware-level or kinematic-level “Sanity Gate” is a valuable second layer of defense.

My goal with ros2_kinematic_guard is not to replace the professional protocols you mentioned, but to provide a “Kinematic Firewall” for the thousands of robots already out there that don’t have that level of protocol integration. It’s a “safety net” for when the primary system logic (or the human operator) is betrayed by the underlying transport layer.

1 Like

Yes, definitely. And my comment was not questioning the value of your project at all. I see how it is complementary to the things I was proposing.

2 Likes

Reigniting a somewhat old discussion :wink: Nice line of thinking here, thank you for raising awareness!

I don’t want to turn this into a LinkedIn-style thing where everyone boasts about their own project instead of reacting to yours but I’d like to add some more context which I guess will go in that direction, I’m sorry :person_bowing:

The mobile-base controllers in ros2_controllers (diff_drive, omni_wheel, mecanum, tricycle, ackermann/bicycle/tricycle_steering) already drop cmd_vel messages whose header.stamp is older than cmd_vel_timeout / reference_timeout, and zero the output in the update loop when no fresh command has arrived. Acceleration and jerk limits are handled by SpeedLimiter in diff_drive (and equivalents elsewhere).

For ref: diff_drive_controller — ROS2_Control: Humble May 2026 documentation

To lean more into the importance of stamping those Twist messages, we’ve removed the option of subscribing to them altogether, with Humble being the last one supporting it through a use_stamped_vel opt-in parameter.

I’m not done yet!!!
Realizing that this isn’t something unique to mobile bases, we’ve also looked into implementing a “decelerate to stop on cancel” functionality to the joint_trajectory_controller so those arms that don’t have a smooth stop function supported by the hardware can rely on it: Decelerate on cancel — ROS2_Control: Rolling May 2026 documentation

1 Like

Thanks, this is very useful context.
I agree that ros2_control is already doing the right thing inside the controller layer. Stamped velocity references, timeouts, speed limiting, and smooth deceleration-on-cancel are exactly the kinds of protections that should exist close to the hardware.

I see ros2_kinematic_guard as complementary rather than competing with that.
The gap I’m trying to explore is one layer higher:

  • Controller-agnostic protection : Not every robot runs the full ros2_control stack. Many AMRs, research platforms, and proprietary bases expose only a simple /cmd_vel subscriber or a vendor-specific driver. A guard node can be inserted before those controllers without refactoring the base driver.
  • Quantitative telemetry, not only local rejection : Dropping stale commands is necessary, but it is usually a local controller action. The guard exposes a residual score, R_NAR , and state transitions such as YELLOW_SLOWDOWN , RED_BRAKE , and RESYNCING . That makes timing degradation observable to higher-level systems, logs, fleet managers, or audits.
  • Command-vs-odometry consistency : Timeouts answer “is this command fresh enough?” The guard asks an additional question: “is this command still kinematically consistent with the current odometry window?” That catches cases where the message is not technically expired, but the command/feedback phase relationship has become suspicious.

So I see the roles as:

ros2_control: enforce safe execution inside well-structured controllers 

ros2_kinematic_guard: provide a controller-agnostic command-integrity gate and diagnostic layer before commands reach the controller

I’m especially interested in whether this kind of quantitative integrity score could support fleet-level diagnostics, safety cases, VDA 5050-style state reporting, or post-incident bag/MCAP analysis.

Thanks again for pointing to the existing ros2_control mechanisms. I’ll make sure the README clarifies this as a complementary diagnostic layer, not a replacement for controller-level safety.

Small follow-up, especially after the useful ros2_control context shared above.
I think there is a useful distinction here:

ROS 2 / ros2_control:
  vehicle-internal command execution and controller-side protection

VDA 5050 / fleet systems:
  fleet-level interoperability, traffic coordination, and state reporting

These layers are complementary.
The question I am now thinking about is whether ROS 2 should expose a more generic command-integrity telemetry concept that can be consumed by higher-level systems.
For example, a vehicle-side node or controller could report:

command_age_ms
jitter_ms
stale_command_detected
burst_detected
command_odom_residual
execution_state: GREEN / YELLOW / RED / RESYNCING
recommended_response: NONE / SLOWDOWN / HOLD / RESYNC_REQUIRED

This would not replace ros2_control timeouts, stamped references, speed limiters, or smooth stop mechanisms. Instead, it would make command degradation observable outside the controller:

  • useful for logs and bag/MCAP post-analysis
  • useful for fleet managers
  • useful for VDA 5050-style state reporting
  • useful for heterogeneous robots that do not all run the same controller stack

So the role I imagine is:

ros2_control:
  executes safely inside well-structured controllers

command-integrity telemetry:
  reports whether the recent command/feedback window is still trustworthy

fleet / orchestration layer:
  uses that signal for degraded-mode coordination

Would it make sense to formalize this into a standardized diagnostic_msgs/DiagnosticStatus convention, or eventually a dedicated CommandIntegrity message?

That would allow ROS 2 vehicles to expose their command-execution health to orchestration layers in a predictable way, without forcing every robot to use the same controller stack.

This also avoids overloading VDA 5050 with low-level ROS details. ROS 2 can produce the vehicle-side execution-integrity signal; VDA 5050 or another fleet interface can summarize it at the fleet level.

cc @bmagyar since this builds on your ros2_control context above.

Small follow-up from this thread.
The discussion here helped clarify the boundary between controller-side protections and higher-level observability. ros2_control already provides important execution-layer mechanisms such as stamped references.
I’ve now refocused ros2_kinematic_guard as a command-integrity telemetry bridge rather than only a local /cmd_vel guard. The new reporter node maps the NARH residual into:

/diagnostics
diagnostic_msgs/DiagnosticArray

/command_integrity/vda5050_state
VDA5050-style fleet telemetry

I opened a new thread with the updated demo here:[Demo] Bridging ROS 2 and VDA5050-style Fleet Telemetry: Command Integrity via NARH
The main question is now less “should this replace controller safety?” and more:

Should ROS 2 expose command-execution integrity as a diagnostic/fleet-observability signal?

1 Like