We’ve just open-sourced our ROS benchmarking suite! Built on top of iRobot’s ros2-performance framework, this is a containerized environment for simulating arbitrary ROS2 systems and graph configurations both simple and complex, comparing the performance of various RMW implementations, and identifying performance issues and bottlenecks.
Support for jazzy, kilted and rolling
Fully containerized, with experimental support for ARM64 builds through docker bake
Container includes fastdds, cyclonedds and zenoh out of the box.
Are you building a custom RMW or ROS executor not included in this tooling, and want to compare against the existing implementations? We provide instructions and examples for how to add them to this suite.
Huge shoutout to Leonardo Neumarkt Fernandez for owning and driving the development of this benchmarking suite!
Thanks for sharing this benchmarking suite: Great work by the iRobot team and contributors. It’s very valuable to see open, reproducible benchmarks like this for the ROS 2 ecosystem.
It’s also encouraging to see that, in the evaluated scenarios, multiple middleware implementations (including Fast DDS, CycloneDDS and Zenoh) show very similar performance results. From our side, this confirms that Fast DDS can deliver competitive performance while supporting a broad feature set and being used in a wide range of production and long-lived systems.
Looking forward to seeing how the suite evolves and to future results.
Really great project. Way cleaner than my small scripts to debug different configurations.
I love the option to have a “heterogeneous” setup with local and remote clients and multiple subscribers with different rates and message sizes. Most benchmarks that compare different RMWs and profiles miss this topology, which is quite essential for our setups with mobile robots, that have lots of local traffic, but also clients to debug and monitor.
I would love to see some compiled graphs of your results. This could be quite eye opening for many and serve as a baseline for debugging our own misconfigurations.
For example: We saw a significant drop in performance for large data with remote subscribers with best effort, which would break the system, when we use cyclondds or fastrtps.
We go with Zenoh now, as this was the only middleware that had no dealbreaking issue when handling large data or hitting bandwidth limitations.
Did you look into python clients as well? With our benchmark we noticed a quite stark performance issue, which sometimes even caused backpressure to the cpp publisher and thus cpp clients with best effort qos.
An Eulogy to ros1: in our tests tcpros was faster than all ros2 RMWs and profiles (cyclondds, fastrtps, zenoh) for all hybrid configurations apart from extremely high frequent, small data (50kHz 1kB). Zenoh is almost as fast as tcpros but for python subscribers and large data in hybrid topologies reaches only 50% of tcpros throughput. Do you experience the same?
Latency is way better in ros2 compared to ros1 in our experience. We see 200-400 usec node to node latency compared to multiple ms in ros1.
The CPU usage was around 10-15% higher compared to ros1 on our initial port to ros2 Iron, but the reasons for that was not DDS, but the executor. Later ros2 versions got more optimized executors, but our stack is now so different that it is not comparable any more.
Throughput never came up on our radar (running only localhost applications) therefore I guess it is ‘good enough’.
Note, we only use rclcpp. In the few python nodes we use we see absurd high cpu usage if not using the events executor.
I would love to see some compiled graphs of your results. This could be quite eye opening for many and serve as a baseline for debugging our own misconfigurations.
Ahead of a talk I’m giving today, here’s a few runs I did with Jazzy on a Raspberry Pi 4 4GB, one for each rclcpp executor (and the planned new EventsExecutor from @JM_ROS !) Zenoh was configured to use a shared memory size of 48mb for all these.
Did you look into python clients as well? With our benchmark we noticed a quite stark performance issue, which sometimes even caused backpressure to the cpp publisher and thus cpp clients with best effort qos.
Right now, ros2-performance is exclusive to rclcpp (as that is what we use at iRobot), so there’s definitely gaps in our benchmarks for rclpy (and rclrs). I would love for the benchmark tool to be ported to those languages so we can exercise the other client libraries with these tests.
I think it would be quite useful for a future version of this suite to be run every so often in CI, so we can monitor the client library performance over time across each distro and detect performance regressions when they happen (instead of over the weekend). When we wrote this tooling, we were intending to do something similar for our rclcpp forks.
What I’m interested in is the throughput in a heterogeneous setup; e.g. publisher with local subscribers, and additionally a remote subscriber, which has bandwidth limitations. That’s why I was thrilled about the repo, which seemed to feature a remote subscriber as well.
Probably this is the major difference between our setups, but the main reason why we use ROS: to have a mobile robot and monitor the system remotely (without disturbin it).
I created large pivot tables with my experiments, but the boiled down result is this:
With small data everything is fine, and ros2 is even faster than ros1, but if you have larger data (laserscans or images) different middlewares fail sometimes catastrophically, sometimes even locally.
The local subscribers are never CPU or Memory limited in my case, but somehow they get throttled (probably you would see this as publisher latency) when a remote subsriber joins. At least with my tested configurations.
We had setups, where CPU usage mattered, but for the moment I would be happy if the performance would stay within usabe regimes on a quite potent (i9, 64GB) pc, and later optimize for the “details”.
FastRTPS and CyclonDDS break down to a level where the onboard control loop would not be operational anymore only because of a remote subscriber. It’s like a quantum state, that breaks down as soon as you want to do some measurements.
Zenoh is not perfect in this regard either, but still 3 times faster than CyclonDDS. The remote gets very little data with zenoh, but in my scenario this is “only” for monitoring.
The only middleware delivering in my tests was tcpros, which kept the local rate while the remote rate was only limitated by the bandwidth (it was a WiFi with 80MB/s so it was ~88% usage of that link)
Does zenoh have a memory leak? Or do you know why the RAM Usage seems to rise?
I think it probably has something to do with the queries_default_timeout. It was changed to 10m in this PR so that it would not time out in CI. I had tried a separate run before these benchmarks where that timeout was dropped to 60000 and that seemed to work, but could be I had applied it wrong in the config for these benchmarks.
Was your talk public and recorded?
ROS By The Bay typically doesn’t record the talks, but I’m happy to share my slides! ros2-benchmark-container.pdf (3.3 MB)