Performance Characteristics: subscription callback signatures, RMW implementation, Intra-process communication (IPC)

mini-1235 · December 19, 2025, 11:11am

Hi all, I am working on Investigate adding in explicit IPC for Nav2 nodes again this week. I have some findings that I think would be worth sharing here.

Minimal Example

To demonstrate the behavior of intra-process communication and subscription callback signatures, I created a minimal example based on the ROS 2 demos and uploaded it in my fork: demos/src/two_node_pipeline/two_node_pipeline.cpp at IPC · mini-1235/demos · GitHub

The setup is simple:

1 publisher node
- publishes using a std::unique_ptr
2 consumer nodes
- Each consumer contains 3 subscriptions to the same topic
They are running in the same executor

The only variables being changed between runs are:

use_intra_process_comms: enabled / disabled for different nodes
Subscription callback type (SharedPtr vs ConstSharedPtr vs const Msg &)

Result Summary

Subscription Type	Subscription Type	Copies
`SharedPtr`	IPC disabled for all nodes	1
`SharedPtr`	IPC enabled for all nodes	n-1 for n subscribers
`SharedPtr`	Producer intra, consumers mixed intra + inter	1 (inter) + n-1(intra)
`SharedPtr`	Producer inter, consumers intra	1
`ConstSharedPtr`	IPC disabled for all nodes	1
`ConstSharedPtr`	IPC enabled for all nodes	0
`ConstSharedPtr`	Producer intra, consumers mixed intra + inter	1
`ConstSharedPtr`	Producer inter, consumers intra	1
`const Msg &`	IPC disabled for all nodes	1
`const Msg &`	IPC enabled for all nodes	n-1 for n subscribers
`const Msg &`	Producer intra, consumers mixed intra + inter	1 (inter) + n-1(intra)
`const Msg &`	Producer inter, consumers intra	1

As shown in the table above, SharedPtr , ConstSharedPtr , and const Msg & behave differently when Intra-process communication is enabled: ConstSharedPtr can achieve zero-copy delivery, whereas SharedPtr and Const Msg & may introduce additional copies. It is also worth noting that SharedPtr is actually deprecated, see Deprecated subscription callback signatures are still compiling · Issue #2972 · ros2/rclcpp · GitHub
For reference, I will also put the link of the design document here.

Testing with Nav2

In addition to the minimal example, I also tested different combinations of IPC settings, RMW implementations, and subscription callback types in a full Nav2 setup.

Test Environment

OS: Ubuntu 24.04.3 LTS (x86_64)
Kernel: 6.8.0-88-generic
CPU: Intel i7-12700H (20 cores)
GPU: Intel Iris Xe + NVIDIA RTX 3050 Ti Mobile
Memory: 32 GB

Setup A: Full Simulation (With RViz and Gazebo enabled)

launch command: ros2 launch nav2_bringup tb4_simulation_launch.py
Baseline measurements correspond to the latest rolling commit at the time of testing.
The CPU Usage reported here refers specifically to the nav2_container process

CPU Usage Result

Middleware	Configuration	CPU Usage (%)
Zenoh	Baseline	4.5%
Zenoh	IPC enabled (with SharedPtr used in most of the stack)	6.0%
Zenoh	IPC + ConstSharedPtr	5.8%
CycloneDDS	Baseline	18.3%
CycloneDDS	IPC enabled (with SharedPtr used in most of the stack)	17.8%
CycloneDDS	IPC + ConstSharedPtr	16.1%
FastDDS	Baseline	6.8%
FastDDS	IPC enabled (with SharedPtr used in most of the stack)	7.8%
FastDDS	IPC + ConstSharedPtr	7.8%

As shown in the table, enabling intra-process communication leads to increased CPU usage for FastDDS and Zenoh, while CycloneDDS shows a reduction in CPU usage.

Setup B: Loopback Simulation (With RViz and Gazebo disabled)

launch command: ros2 launch nav2_bringup tb4_loopback_simulation_launch.py use_rviz:=false
Baseline measurements correspond to the latest rolling commit at the time of testing
The CPU Usage reported here refers specifically to the nav2_container process

CPU Usage Result

Middleware	Configuration	CPU Usage (%)
Zenoh	Baseline	0.8%
Zenoh	IPC enabled (with SharedPtr used in most of the stack)	0.8%
Zenoh	IPC + ConstSharedPtr	0.8%
CycloneDDS	Baseline	1.6%
CycloneDDS	IPC enabled (with SharedPtr used in most of the stack)	1.7%
CycloneDDS	IPC + ConstSharedPtr	1.7%
FastDDS	Baseline	1.2%
FastDDS	IPC enabled (with SharedPtr used in most of the stack)	1.2%
FastDDS	IPC + ConstSharedPtr	1.2%

Under the loopback simulation with RViz and Gazebo disabled, CPU usage remains effectively unchanged across different IPC and subscription callback settings for all three middleware.

Discussion

Based on the first experiments above, I would expect some performance gains when enabling intra-process communication and using ConstSharedPtr . However, this expectation was not met when testing in a full Nav2 setup, and in the full simulation setup we even observed performance drop with FastDDS and Zenoh.

Therefore, I would like to invite community members as well as the ros2 team to share their experiences and insights on these results.

roncapat · December 19, 2025, 8:57pm

Hi, very nice analysis work. One small question arises when reading the quoted statement: what do you mean exactly for “performance gains”? CPU usage reduction? From latency perspective, using intra-process with zero copy should be effective indeed.

About Rviz enabled/disabled comparison, one neat trick is to compose Rviz so that it can subscribe in zero-copy mode. I still have to check the type of subscription signatures used in default plugins, but one can then also load custom plugins optimized for IPC or even better with Type Adaptation support.

mini-1235 · December 22, 2025, 3:05pm

Hi!

Yes, exactly, I was referring to reduced CPU usage

I think this is still a WIP in rviz(?) https://github.com/ros2/rviz/pull/1140

I think what I want to discuss in this post can be summarized as follows:
Can we expect CPU usage reduction from enabling IPC? If yes, what are the required conditions?

RMW Implementation?
Subscription callback signatures?
Fully encapsulated setup? (i.e. no interprocess subscriptions such as rviz or rosbag)

Based on my tests in Nav2, I observe a CPU usage reduction when using CycloneDDS in Setup A, with an additional reduction when switching all callbacks to ConstSharedPtr. However, I do not see the same pattern when using other RMWs.

roncapat · December 23, 2025, 11:01am

Well - I ended up in statically-composing Rviz2 in the same process space on some of my setups. This PR is more related to have component semantics for Rviz (and more in particular, the PR proposes to embed a component container inside Rviz2 main executable).

mjcarroll · January 5, 2026, 2:58pm

Hey @mini-1235, I haven’t forgotten about this. I took a little time to unplug and spend time with the family over the holidays and I will give this a look in the next few days.

Thanks for all of your work characterizing the behaviors and the excellent report!

mini-1235 · January 5, 2026, 5:28pm

Thanks for the update! No worries at all

wjwwood · January 5, 2026, 11:18pm

Thanks for collecting the data and starting the conversation.

Some cursory thoughts:

As mentioned above, consider comparing latency numbers too as that’s where you’ll see a lot of the improvements, though if you don’t care about the latency as much then I suppose it may not matter, but it might help you confirm something else weird isn’t happening.
How did you collect the cpu utilization? Did you wait for steady-state of some kind and then average a window of time? Did you run the experiment multiple times for each case? The differences in CPU utilization are close enough that it’s not clear to me weather it’s within the noise of the measurements or not (no error bars). Is this on a soft/hard real-time OS or just an OS with a standard scheduler, etc. Put another way, how have you convinced yourself that you’re measuring the intra-process vs inter-process setting or something else?
Related to clarifying how you’re collecting CPU statistics, looking at call graphs or flame charts or something before and after intra-process is turned on would be interesting, because it may be possible that inter-process comms (and copying messages) are not a significant source of CPU utilization in the first place.
How did you determine the number of copies? Just code review or did you instrument it somehow?
- I’m surprised by the copies with const & T callbacks, that should be no different from shared_ptr<const T>, as all we’d need to do is dereference the shared-const ptr before passing it as a const reference. Do you know why that’s the case?
- See rclcpp/rclcpp/include/rclcpp/any_subscription_callback.hpp at c85ff926d2ea6c3fb8b8075e28b93632a791fd5c · ros2/rclcpp · GitHub , it really seems like const & T would be no different from shared_ptr<const T>.
I would need to look again, but it may be that intra-process is entirely “extra” work on top of inter-process, i.e. if we never avoid sending messages to the rmw layer for inter-process comms (e.g. when there’s no non-intra-process subscribers), then the intra-process comms is just more work on top of inter-process comms which always occurs. Ideally, the rmw layers would use less resources when you publish to them with no inter-process subscriptions present, but I’m not 100% sure about that either (this could be impacted by what QoS setting you’re using too, e.g. durability transient-local requires the inter-process to do most of the work when publishing, even when there are currently no subscriptions, due to needing to keep stuff around for late joining subscriptions).
- Follow up, I looked at the code and there’s definitely still logic to avoid inter-process when publishing under certain conditions (no inter-process subscriptions and not using transient-local durability), but I never got a chance to properly look at this pull request that added support for transient-local durability in intra-process ( Fix transient local IPC publish by jefferyyjhsu · Pull Request #2708 · ros2/rclcpp · GitHub ), so that might be something to have a look at, if we’re getting weird performance from intra-process.
When talking about this stuff, I’d recommend explicitly using the phrase “intra-process comms” rather than “IPC”, as everywhere else on the internet IPC means inter-process comms (see Inter-process communication - Wikipedia ), which is sure to be a source of confusion. If you must abbreviate it, use something like “intra-”, “intra-pc”, etc. In my experience, people constantly confuse inter-process and intra-process even in the context of ROS and because of that I always just type out intra-process and inter-process explicitly. It’s worth the effort imo.

JM_ROS · January 6, 2026, 9:45pm

I took a look at the example you used for the test. You seem to use int32 as message type. As the cost of copying this is basically zero I would not expect any difference using zero copy.

Try using a big message (> 10 MB) then you should see a difference in CPU load.

Edit:

@wjwwood pointed out that the test was performed using the nav2 stack. This leads to the question if the nav stack actually sends ‘big’ messages in a high frequency. Judging from the results this seems not to be the case.

mini-1235 · January 12, 2026, 1:37pm

Hi @JM_ROS @wjwwood, thanks for your reply. It’s been a pretty busy week for me, but I finally had some time to think through / test the suggestions you mentioned.

Sure, that make sense. I am trying to update the post to replace “IPC” with “intra-process comms”, but I am unable to find an edit button. If editing is still possible, please let me know and I’ll be happy to update it

Inspired by the demo in ros2_documentation, I am printing the address in each subscription callback.

Take const std_msgs::msg::Int32 & msg as an example. It enters the condition here: rclcpp/rclcpp/include/rclcpp/any_subscription_callback.hpp at c85ff926d2ea6c3fb8b8075e28b93632a791fd5c · ros2/rclcpp · GitHub
Adding

std::holds_alternative<ConstRefROSMessageCallback>(callback_variant_) ||
std::holds_alternative<ConstRefWithInfoROSMessageCallback>(callback_variant_)

to use_take_shared_method() seems to match the behavior of shared_ptr<const T>.

To my knowledge, the largest messages in the Nav2 stack are the global and local costmaps. With the default settings, the global costmap is published at 1 Hz and the local costmap at 2 Hz, and their sizes are on the order of a few hundred kilobytes. We are also subscribing to sensor data from Gazebo. I checked those as well, they are published at around 10 Hz with message sizes on the order of tens of kilobytes.

wjwwood:

I would need to look again, but it may be that intra-process is entirely “extra” work on top of inter-process, i.e. if we never avoid sending messages to the rmw layer for inter-process comms (e.g. when there’s no non-intra-process subscribers), then the intra-process comms is just more work on top of inter-process comms which always occurs. Ideally, the rmw layers would use less resources when you publish to them with no inter-process subscriptions present, but I’m not 100% sure about that either (this could be impacted by what QoS setting you’re using too, e.g. durability transient-local requires the inter-process to do most of the work when publishing, even when there are currently no subscriptions, due to needing to keep stuff around for late joining subscriptions).

Follow up, I looked at the code and there’s definitely still logic to avoid inter-process when publishing under certain conditions (no inter-process subscriptions and not using transient-local durability), but I never got a chance to properly look at this pull request that added support for transient-local durability in intra-process ( Fix transient local IPC publish by jefferyyjhsu · Pull Request #2708 · ros2/rclcpp · GitHub ), so that might be something to have a look at, if we’re getting weird performance from intra-process.

I tried writing a quick demo that publishes larger messages, and I can clearly see a non-trivial reduction in CPU usage. When switching the qos to transient local, I do not observe any difference between inter-process and intra-process communication. I think this also explains the results in setup B, since the costmaps topics are published with transient local.

Yes. I usually wait about 1–2 minutes for the system to reach a steady state, and then compute the average CPU utilization. When preparing the report, I also repeated each experiment multiple times, and on different days. The error bars should be approximately ±0.3 and they are conducted on an OS with a standard scheduler.

Finally, to help explain the behavior I observed in setup A and to move this discussion forward, I also collected some flamegraphs. I have uploaded the SVG files here for convenience: https://drive.google.com/drive/folders/1A75ddUFnMZlh8DB0X-_BsRWAM-ovSJ-i?usp=sharing

zenoh (Intra-process comms off)

zenoh (Intra-process comms on)

Differential flamegraphs (./diff off on)

From the flamegraph and perf top, it seems that it is related to rclcpp::experimental::IntraProcessManager::matches_any_publishers() (?)

cyclone (Intra-process comms off)

cyclone (Intra-process comms on)

Differential flamegraphs (./diff off on)

Similar to the zenoh results, this seems to point to the rclcpp::experimental::IntraProcessManager::matches_any_publishers() as well.

I hope this helps move the discussion forward. Looking forward to your thoughts!

roncapat · January 13, 2026, 8:04pm

`matches_any_publishers()` seems to be called on subscription side to check if message is oging to be delivered intra-process - and thus circumvent the “double delivery“ issue. It has a for loop inside, and various “locks”…

Fast link for whoever is curious enough: rclcpp repo (snippet).

It would be interesting to understand what is the hottest line inside that function - for sure on large systems, having such for loop running each and every message in every subscriber may seem lot of work. On the other hand, I understand why it is needed and don’t know how that piece of code could be better.

Maybe a cache mechanism with a flag that is invalidated only at graph change (i.e. pubs added/removed from the graph)?

JM_ROS · January 14, 2026, 6:18am

We could do an obvious and fast optimization : on call of add_publisher save the gid to a new gid vector and remove it from the vector on call of remove_publisher. This would get rid of all the weak pointer lock calls which are somewhat expensive.

publishers_ is also a map which is not the cache friendliest structure to iterate so using a vector here would give an additional speed up.

roncapat · February 12, 2026, 8:55pm

Has anybody opened an Issue on github to track this code hotspot in rclcpp yet? Just checking so that this rather detailed analysis and initial brainstorming doesn’t get lost.

mini-1235 · February 16, 2026, 7:54am

Thanks for the reminder, I have opened an issue here:
Investigate performance of `matches_any_publishers()` when enabling intraprocess · Issue #3067 · ros2/rclcpp · GitHub regarding the matches_any_publishers() function
and a PR here: Add ConstRefCallback in `use_take_shared_method` by mini-1235 · Pull Request #3066 · ros2/rclcpp · GitHub regarding ConstRefCallback

Feel free to comment

Topic		Replies	Views
Nav2 composition Navigation	9	6212	September 9, 2021
ROS2 Foxy & RMW Fast DDS: Improved Intra-process & Inter-process performance ROS General	15	3648	November 30, 2020
Reducing ROS 2 CPU overhead by simplifying the ROS 2 layers ROS General	11	8437	October 15, 2021
Increasing memory usage for publishing nodes with enabled intra-process communication ROS General	6	2121	December 5, 2017
Predictability of zero-copy message transport ROS General ros2	2	297	February 24, 2026

Performance Characteristics: subscription callback signatures, RMW implementation, Intra-process communication (IPC)

Minimal Example

Result Summary

Testing with Nav2

Test Environment

Setup A: Full Simulation (With RViz and Gazebo enabled)

CPU Usage Result

Setup B: Loopback Simulation (With RViz and Gazebo disabled)

CPU Usage Result

Discussion

Related topics