I was fighting some performance issues in our simulation for a subject I teach.
It boiled down to one slight performance improvement in rclpy which I started discussing, but the larger surprise was how super-bad FastDDS is with distributing the 250 Hz /clock topic.
I did an experiment with our full autonomy stack (but on a Turtlebot with 2D lidar, so nothing heavy). Most nodes run rclpy with MultiThreadedExecutor (more on that later). There are in total 16 nodes that subscribe /clock topic and Gazebo runs at ~70% RTF, so the real frequency of /clock is more like 200 Hz.
The vertical axis shows the dt - delta time, i.e. the time difference between two consecutive /clock messages received by the node (measured by instrumenting the default clock callback of a node).
With FastDDS, some nodes sometimes do not receive any /clock message for more than 0.5 s!!! With Zenoh, the worst case is about 0.18 s.
Even more interesting is the best case - FastDDS achieves 0.004 dt (what is expected) for only a small fraction of the dataset, like 5%. Zenoh achieves this dt most of the time (the almost solid blue line on the bottom). (no, this blue line does not hide the pink dots, there are almost none underneath it).
This is the histogram of dt values (notice logarithmic axis, Zenoh left, FastDDS right):
I understand that the combination of rclpy and MultiThreadedExecutor is one of the worst things you can do in ROS 2, but still, this huge difference in usability between FastDDS and Zenoh hits me.
I did a few more tests with SingleThreadedExecutor, and it helps a bit in one node (no large subscribed topics), but it yields way worse results for /clock topic in other nodes (those that process LaserScan) messages.
So my conclusion is that the MultiThreadedExecutor itself is not the culprit.
After closer inspection of our code, I came to the conclusion that a part of the problem is that we do some computations in some callbacks (e.g. the ICP alignment). The callbacks are not super-long, but for a lidar coming at 10 Hz, they can sometimes swallow up to 0.2 s, which would be too slow for a single thread. But our nodes know that and use multithreaded executor for this reason.
On single-threaded executors (and events executor), this means that during the computation, no clock messages are processed, and as they are BEST_EFFORT/KEEP_LAST(1), they just get thrown away.
What I still donāt understand that much is why are the spikes in dt so large even for multithreaded executor. The executor has enough threads to handle all messages concurrently. My working theory is that the RMW loses some messages in-flight (clock has KEEP_LAST(1) so the publisher may easily give up when the output socket is congested; but is it even possible for localhost to get congested unless youāre reaching RAM/kernel bandwidth?). But I havenāt proved this yet. And because of the loss of messages (both clock and TF), the callbacks could take longer and longer until reaching TF timeouts. From the experiments I did, it seems that FastDDS is more prone to this loss of messages than Zenoh.
I think that a big part of the problem is exactly the fact that clock messages are blocked by ānormalā user callbacks and time is basically paused in any user callbacks (using the default single-threaded executor). It seems to me that /clock, similar to TransformListener, should be (by default, or at least on request) run in its own node and its own thread. This would at least relieve the users from thinking about how time advances during simulation. With the current design, it seems to me that writing a node that works only in real time would use very different design patterns than sim-ready nodes (youād basically need to keep all your callbacks super-short and have a pool of ready worker threads to which you just relay the work after receiving a message). Iāve added this idea to Insulate /clock callback on specific thread Ā· Issue #792 Ā· ros2/rclpy Ā· GitHub . If thereās interest for it, Iāll try to implement it.
Hi @peci1 , kind of an extreme solution. But you can write a gazebo plugin which uses the native zenoh bindings for publishing the clock while following the rmw_zenoh spec( liveliness, attachment, qualified topic name, fully compatible key expression, proper serialisation). This topic will then by discoverable and subscribable via ROS2(assuming you have a zenoh router running). You can also set message priority when using the native zenoh api, so even if you have other zenoh nodes/topics this will still take priority.
Did you make sure to pass separate callback groups for each callback you want to run in parallel. In rclpy, each callback defaults to a shared callback group, so switching to multi-threaded executor is not enough for parallel execution. Additionally, if you want the same callback to be able to run before the previous call is finished, make sure to pass a ReentrantCallbackGroup. (more info here)
Yes, this is one interesting finding I realized while inspecting this issue. Initially, we did not care about callback groups. Then I at least implemented the workaround for /clock and TF to be on their own groups, so that clock and TF messages should no longer be distracted by long-running callbacks on the multi-threaded executors. This sole thing itself helped a lot! Iāve contributed a small change that should allow this for TFs natively. An approach for clock is still being discussed in rclpy.
Other from that, Iād like to rewrite our processing to be able to run in parallel, but that needs a little bit more thinking and restructuring.
This is a fantastic (and terrifying) visualization of what Iāve been calling āWireless Command Collapseā, though youāve proven it happens even in local rclpy simulations!
Those 0.5s gaps in /clock are exactly the kind of ātiming betrayalsā that lead to PID derivative kicks and kinematic jumps. Iāve been working on ros2_kinematic_guard to handle precisely these scenarios.
While you are comparing RMWs to find a more stable dt, Iām focusing on what the robot should do when those pink dots (spikes) occur. In my project, these gaps trigger a NARH-lite residual spike, moving the robot into a BRAKE_AND_RESYNC state instead of letting it execute a burst of āstaleā commands that accumulated during that 0.5s silence.
Your data perfectly illustrates why a āPhysical Firewallā is needed regardless of the RMW choiceāeven Zenoh has its jitter spikes!