ROS2 "State of the Events Executors" - Benchmark comparison between rclcpp::experimental::EventsExecutor and cm_executors::EventsCBGExecutor

As part of the upcoming ROS2 Lyrical Luth release, the client library working group has been planning to mainstream an EventsExecutor implementation as the new default executor in rclcpp. The current experimental implementation is limited by its inability to properly handle simulation time, unlike the EventsCBGExecutor implemented by @JM_ROS over at Cellumation, which can properly handle sim time as well as offering a multithreaded mode. As a first step towards mainstreaming an EventsExecutor implementation, we ran an extensive set of benchmarks built on top of iRobot’s ros2-performance framework (Keep an eye out, as we are hoping to eventually open-source the full benchmark test suite!)

This post will serve as a deep dive into the performance characteristics of the two executors as well as a jumping off point for discussing the overall state of executors (and middleware implementations) in ROS2. (This is a cleaned-up rewrite of a github gist that I originally put all the benchmark info into)

Some notes about the benchmarks:

  • Benchmark environment:
    upstream ROS rolling docker container running on an x86 developer laptop under minimal load
    rclcpp rolling: b14af74a4c9b8683e72b15d61d0ed9121d883973
    cm_executors: 783a5e329ee8b04abfa3b3397532e979576a2b1f
    ros2_performance: 4528f43410922379b8da501630d9d938046e48e8
  • This suite of benchmarks was run at least 3 times per implementation, to ensure consistent results. For brevity’s sake, we’ll stick to one graph each for this analysis, but the full set of results will be made available elsewhere.
  • ipc_on = running with intra-process mode
  • In the latency tests, max latency signifies the highest single latency measurement taken for that message size. We didn’t do any outlier filtering on this dataset (aside from the high latencies from the first few seconds), so this value is known to have more consistent variation.
  • In the process of producing these benchmarks, we discovered a bug with the generation of clients/services single and multi process CPU usage. Graphs were generated for each, but the underlying data represents just the single process benchmark so we’ll only cover single process clients/services CPU usage.
  • There were a few tests we couldn’t run with the EventsCBGExecutor because of freezes or crashes, and so those tests were also omitted for the upstream EventsExecutor.
  • For a more 1:1 comparison between the two executors, the EventsCBGExecutor was fixed to use just one thread.

Takeaways, tl;dr:

  • Despite some initial concerns about marginally higher CPU usage for the EventsCBGExecutor compared to the experimental EventsExecutor, there doesn’t appear to be too much of a difference across all the characteristics we tested, with the following exceptions

    • EventsExecutor performed slightly better on the long running pub/sub CPU usage test.
    • EventsCBGExecutor performed slightly better on the long running actions CPU usage test.
  • Both executors demonstrate memory leaks in the longer running pub / sub and actions tests. After further investigation, the SingleThreadedExecutor and MultiThreadedExecutor also show a climb in memory for pub/sub, while actions remain stable for the SingleThreadedExecutor (except for rmw_zenoh).

  • As we step through the benchmarks, I’ll point out any differences between the executors as they appear.

CPU Usage - Pub/Sub - Single Process

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

We can see that the max y axis for the second graph is way higher due to CycloneDDS seemingly causing the test to consume way more CPU at higher message payloads, amidst otherwise highly comparable results. This difference in CPU for CycloneDDS specifically was consistent across all runs of the benchmark suite.

CPU Usage - Pub Sub - Multi Process

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

Interestingly, in multi-process mode the climb to ~4-5% of a core at larger payload sizes is now consistent for both executors when running CycloneDDS. Otherwise, both executors seem to put up similar results here.

CPU Usage - Services / Clients - Single Process

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

CPU Usage - Pub/Sub - Long Running Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

The usage pattern for both executors appears fairly similar, with the EventsExecutor averaging around 0.05 - 0.1% less CPU usage than EventsCBGExecutor in most runs.

CPU Usage - Services / Clients - Long Running Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

CPU Usage - Actions - Long Running Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

We again see a similar usage pattern between the two executors, but with the EventsCBGExecutor consistently maxing out at ~2% less CPU than the EventsExecutor and with a smoother looking graph.

Publisher Latency - Single Process

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

Subscriber Latency - Single Process

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

Huge differences in max latency aside, we see comparable results here between the two executor implementations for both pub and sub latency. The mean comparison demonstrates extremely similar results, including CycloneDDS’s extreme latency increases at higher payload sizes. The latency increases appear to exaggerate with slightly smaller payloads in EventsCBGExecutor than in EventsExecutor.

Publisher Latency - Multi Process

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

Subscriber Latency - Multi Process

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

Publisher Latency - Long Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

Subscriber Latency - Long Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

Memory Scaling Comparison

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor

RAM Usage - Pub/Sub - Long Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor
rclcpp::SingleThreadedExecutor rclcpp::MultiThreadedExecutor

Not much difference between the two events executors. This appears to expose a slow climbing memory leak in the client library side, either with both of these executor implementations or in some other part of the code. This leak appears consistent across all RMWs and across all runs of all four executors (single threaded, multi threaded, EventsExecutor, EventsCBGExecutor). Zenoh without intraprocess shows a much sharper increase the first few minutes in.

RAM Usage - Services/Clients - Long Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor
rclcpp::SingleThreadedExecutor rclcpp::MultiThreadedExecutor

Not much different across the executors, with the multi-threaded executor exhibiting much higher overall baselines in RAM usage. We again see RAM climbing for all four, but the rate of usage appears to level out about 5 or so minutes into the tests.

RAM Usage - Actions - Long Test (10m)

rclcpp::experimental::EventsExecutor cm_executors::EventsCBGExecutor
rclcpp::SingleThreadedExecutor rclcpp::MultiThreadedExecutor

Both EventsExecutor implementations demonstrate significant memory leaks during the long running actions tests. The multi-threaded executor’s usage pattern looks similar to clients / services. In the SingleThreadedExecutor, rmw_zenoh appears to exhibit leaks unlike the other tested RMWs.

13 Likes

@skye.galaxy

thank you very much for sharing benchmark information :+1:

i do not have any objections about this, just checking. (since it becomes really hard for me to join, cz i got back to Tokyo…)

> the client library working group has been planning to mainstream an EventsExecutor implementation as the new default executor in rclcpp.

what is that supposed to mean by `default`?

basically Executors are the class that user application choose to run?

do you guys mean that,

is going to be replaced into `EventsExecutor`?

aside from performance benchmark, this will be huge behavior change for user who rely on the current queue management in rmw implementation. probably we would want to make an announcement about that before the change…

tomoya,

Just to be clear, there are three work items here:

  1. Merging/replacing the current events executor implementation with Janosch’s.
  2. Promoting the events executor from the “experimental” namespace
  3. Switching the default “spin” executor to be the events executor.

For L-turtle, I definitely see 1 and 2 happening. (3) may take more work (which @skye.galaxy is building evidence for) to build enough confidence that the switch is worth it. I think that overall behavior and performance is better, but changing default behavior always warrants some caution.

Do you think it would make sense to have an AIPAC working group? Since working groups are effectively an extension of the PMC, we could set it up as a common time/location to discuss various topics.

This likely relates to this issue: Memory leak in publishing · Issue #796 · ros2/rmw_zenoh · GitHub
It has been fixed few weeks ago: fix: resolve memory leak when publishing with the default allocator by YuanYuYuan · Pull Request #797 · ros2/rmw_zenoh · GitHub
But not released yet.
We intend to bloom rmw_zenoh with this fix for all distros in the upcoming weeks, after bumping the Zenoh version to incorporate other improvements.

1 Like

Ah, you are probably right and I should have put those pieces together when reading this. Good catch!

@mjcarroll thanks for the clarifications, now i see what is going to happen for next release!

maybe i should bring up a proposal :grin:

tomoya,