ROS2 "State of the Events Executors" - Benchmark comparison between rclcpp::experimental::EventsExecutor and cm_executors::EventsCBGExecutor

skye.galaxy · September 30, 2025, 12:17am

As part of the upcoming ROS2 Lyrical Luth release, the client library working group has been planning to mainstream an EventsExecutor implementation as the new default executor in rclcpp. The current experimental implementation is limited by its inability to properly handle simulation time, unlike the EventsCBGExecutor implemented by @JM_ROS over at Cellumation, which can properly handle sim time as well as offering a multithreaded mode. As a first step towards mainstreaming an EventsExecutor implementation, we ran an extensive set of benchmarks built on top of iRobot’s ros2-performance framework (Keep an eye out, as we are hoping to eventually open-source the full benchmark test suite!)

This post will serve as a deep dive into the performance characteristics of the two executors as well as a jumping off point for discussing the overall state of executors (and middleware implementations) in ROS2. (This is a cleaned-up rewrite of a github gist that I originally put all the benchmark info into)

Some notes about the benchmarks:

Benchmark environment:
upstream ROS rolling docker container running on an x86 developer laptop under minimal load
rclcpp rolling: b14af74a4c9b8683e72b15d61d0ed9121d883973
cm_executors: 783a5e329ee8b04abfa3b3397532e979576a2b1f
ros2_performance: 4528f43410922379b8da501630d9d938046e48e8
This suite of benchmarks was run at least 3 times per implementation, to ensure consistent results. For brevity’s sake, we’ll stick to one graph each for this analysis, but the full set of results will be made available elsewhere.
ipc_on = running with intra-process mode
In the latency tests, max latency signifies the highest single latency measurement taken for that message size. We didn’t do any outlier filtering on this dataset (aside from the high latencies from the first few seconds), so this value is known to have more consistent variation.
In the process of producing these benchmarks, we discovered a bug with the generation of clients/services single and multi process CPU usage. Graphs were generated for each, but the underlying data represents just the single process benchmark so we’ll only cover single process clients/services CPU usage.
There were a few tests we couldn’t run with the EventsCBGExecutor because of freezes or crashes, and so those tests were also omitted for the upstream EventsExecutor.
For a more 1:1 comparison between the two executors, the EventsCBGExecutor was fixed to use just one thread.

Takeaways, tl;dr:

Despite some initial concerns about marginally higher CPU usage for the EventsCBGExecutor compared to the experimental EventsExecutor, there doesn’t appear to be too much of a difference across all the characteristics we tested, with the following exceptions
- EventsExecutor performed slightly better on the long running pub/sub CPU usage test.
- EventsCBGExecutor performed slightly better on the long running actions CPU usage test.
Both executors demonstrate memory leaks in the longer running pub / sub and actions tests. After further investigation, the SingleThreadedExecutor and MultiThreadedExecutor also show a climb in memory for pub/sub, while actions remain stable for the SingleThreadedExecutor (except for rmw_zenoh).
As we step through the benchmarks, I’ll point out any differences between the executors as they appear.

CPU Usage - Pub/Sub - Single Process

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×1200 88.6 KB	image1200×1200 70.3 KB

We can see that the max y axis for the second graph is way higher due to CycloneDDS seemingly causing the test to consume way more CPU at higher message payloads, amidst otherwise highly comparable results. This difference in CPU for CycloneDDS specifically was consistent across all runs of the benchmark suite.

CPU Usage - Pub Sub - Multi Process

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×800 51.5 KB	image1200×800 51.7 KB

Interestingly, in multi-process mode the climb to ~4-5% of a core at larger payload sizes is now consistent for both executors when running CycloneDDS. Otherwise, both executors seem to put up similar results here.

CPU Usage - Services / Clients - Single Process

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×1200 72.6 KB	image1200×1200 82.1 KB

CPU Usage - Pub/Sub - Long Running Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1000×500 138 KB	image1000×500 115 KB

The usage pattern for both executors appears fairly similar, with the EventsExecutor averaging around 0.05 - 0.1% less CPU usage than EventsCBGExecutor in most runs.

CPU Usage - Services / Clients - Long Running Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1000×500 87.9 KB	image1000×500 88.2 KB

CPU Usage - Actions - Long Running Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1000×500 59.3 KB	image1000×500 51.4 KB

We again see a similar usage pattern between the two executors, but with the EventsCBGExecutor consistently maxing out at ~2% less CPU than the EventsExecutor and with a smoother looking graph.

Publisher Latency - Single Process

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×1200 109 KB	image1200×1200 106 KB
image1200×800 49.5 KB	image1200×800 51.4 KB

Subscriber Latency - Single Process

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×1200 119 KB	image1200×1200 122 KB
image1200×800 51.1 KB	image1200×800 52.2 KB

Huge differences in max latency aside, we see comparable results here between the two executor implementations for both pub and sub latency. The mean comparison demonstrates extremely similar results, including CycloneDDS’s extreme latency increases at higher payload sizes. The latency increases appear to exaggerate with slightly smaller payloads in EventsCBGExecutor than in EventsExecutor.

Publisher Latency - Multi Process

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×800 80.4 KB	image1200×800 80.3 KB
image1200×800 48.1 KB	image1200×800 49.4 KB

Subscriber Latency - Multi Process

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×800 89.8 KB	image1200×800 90.9 KB
image1200×800 49.7 KB	image1200×800 50.8 KB

Publisher Latency - Long Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×1200 238 KB	image1200×1200 240 KB

Subscriber Latency - Long Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1200×1200 251 KB	image1200×1200 247 KB

Memory Scaling Comparison

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1511×1489 343 KB	image1511×1489 343 KB

RAM Usage - Pub/Sub - Long Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1000×500 49.8 KB	image1000×500 49.5 KB
rclcpp::SingleThreadedExecutor	rclcpp::MultiThreadedExecutor
rss_KB_pub-sub_long_test1000×500 49 KB	rss_KB_pub-sub_long_test1000×500 50.9 KB

Not much difference between the two events executors. This appears to expose a slow climbing memory leak in the client library side, either with both of these executor implementations or in some other part of the code. This leak appears consistent across all RMWs and across all runs of all four executors (single threaded, multi threaded, EventsExecutor, EventsCBGExecutor). Zenoh without intraprocess shows a much sharper increase the first few minutes in.

RAM Usage - Services/Clients - Long Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1000×500 30.5 KB	image1000×500 32.9 KB
rclcpp::SingleThreadedExecutor	rclcpp::MultiThreadedExecutor
rss_KB_cli-srv_long_test1000×500 34.4 KB	rss_KB_cli-srv_long_test1000×500 43.2 KB

Not much different across the executors, with the multi-threaded executor exhibiting much higher overall baselines in RAM usage. We again see RAM climbing for all four, but the rate of usage appears to level out about 5 or so minutes into the tests.

RAM Usage - Actions - Long Test (10m)

rclcpp::experimental::EventsExecutor	cm_executors::EventsCBGExecutor
image1000×500 47.5 KB	image1000×500 48.8 KB
rclcpp::SingleThreadedExecutor	rclcpp::MultiThreadedExecutor
rss_KB_actions_long_test1000×500 31.2 KB	rss_KB_actions_long_test1000×500 34.9 KB

Both EventsExecutor implementations demonstrate significant memory leaks during the long running actions tests. The multi-threaded executor’s usage pattern looks similar to clients / services. In the SingleThreadedExecutor, rmw_zenoh appears to exhibit leaks unlike the other tested RMWs.

tomoyafujita · September 30, 2025, 5:39am

@skye.galaxy

thank you very much for sharing benchmark information

i do not have any objections about this, just checking. (since it becomes really hard for me to join, cz i got back to Tokyo…)

> the client library working group has been planning to mainstream an EventsExecutor implementation as the new default executor in rclcpp.

what is that supposed to mean by `default`?

basically Executors are the class that user application choose to run?

do you guys mean that,

github.com/ros2/rclcpp

rclcpp/src/rclcpp/executors.cpp

aa60fcf22


      
            options.context = node_ptr->get_context();
            rclcpp::executors::SingleThreadedExecutor exec(options);
            exec.spin_node_some(node_ptr);
          }
          
          void
          rclcpp::spin(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr node_ptr)
          {
            rclcpp::ExecutorOptions options;
            options.context = node_ptr->get_context();
            rclcpp::executors::SingleThreadedExecutor exec(options);
            exec.add_node(node_ptr);
            exec.spin();
            exec.remove_node(node_ptr);
          }
          
          void
          rclcpp::spin(rclcpp::Node::SharedPtr node_ptr)
          {
            rclcpp::spin(node_ptr->get_node_base_interface());
          }

is going to be replaced into `EventsExecutor`?

aside from performance benchmark, this will be huge behavior change for user who rely on the current queue management in rmw implementation. probably we would want to make an announcement about that before the change…

tomoya,

mjcarroll · September 30, 2025, 12:53pm

Just to be clear, there are three work items here:

Merging/replacing the current events executor implementation with Janosch’s.
Promoting the events executor from the “experimental” namespace
Switching the default “spin” executor to be the events executor.

For L-turtle, I definitely see 1 and 2 happening. (3) may take more work (which @skye.galaxy is building evidence for) to build enough confidence that the switch is worth it. I think that overall behavior and performance is better, but changing default behavior always warrants some caution.

mjcarroll · September 30, 2025, 12:54pm

Do you think it would make sense to have an AIPAC working group? Since working groups are effectively an extension of the PMC, we could set it up as a common time/location to discuss various topics.

JEnoch · September 30, 2025, 2:27pm

This likely relates to this issue: Memory leak in publishing · Issue #796 · ros2/rmw_zenoh · GitHub
It has been fixed few weeks ago: fix: resolve memory leak when publishing with the default allocator by YuanYuYuan · Pull Request #797 · ros2/rmw_zenoh · GitHub
But not released yet.
We intend to bloom rmw_zenoh with this fix for all distros in the upcoming weeks, after bumping the Zenoh version to incorporate other improvements.

mjcarroll · September 30, 2025, 2:56pm

Ah, you are probably right and I should have put those pieces together when reading this. Good catch!

tomoyafujita · October 1, 2025, 4:52am

@mjcarroll thanks for the clarifications, now i see what is going to happen for next release!

maybe i should bring up a proposal