Performance Characteristics: subscription callback signatures, RMW implementation, Intra-process communication (IPC)

Hi all, I am working on Investigate adding in explicit IPC for Nav2 nodes again this week. I have some findings that I think would be worth sharing here.

Minimal Example

To demonstrate the behavior of intra-process communication and subscription callback signatures, I created a minimal example based on the ROS 2 demos and uploaded it in my fork: demos/src/two_node_pipeline/two_node_pipeline.cpp at IPC · mini-1235/demos · GitHub

The setup is simple:

  • 1 publisher node
    • publishes using a std::unique_ptr
  • 2 consumer nodes
    • Each consumer contains 3 subscriptions to the same topic
  • They are running in the same executor

The only variables being changed between runs are:

  1. use_intra_process_comms: enabled / disabled for different nodes
  2. Subscription callback type (SharedPtr vs ConstSharedPtr vs const Msg &)

Result Summary

Subscription Type Subscription Type Copies
SharedPtr IPC disabled for all nodes 1
SharedPtr IPC enabled for all nodes n-1 for n subscribers
SharedPtr Producer intra, consumers mixed intra + inter 1 (inter) + n-1(intra)
SharedPtr Producer inter, consumers intra 1
ConstSharedPtr IPC disabled for all nodes 1
ConstSharedPtr IPC enabled for all nodes 0
ConstSharedPtr Producer intra, consumers mixed intra + inter 1
ConstSharedPtr Producer inter, consumers intra 1
const Msg & IPC disabled for all nodes 1
const Msg & IPC enabled for all nodes n-1 for n subscribers
const Msg & Producer intra, consumers mixed intra + inter 1 (inter) + n-1(intra)
const Msg & Producer inter, consumers intra 1

As shown in the table above, SharedPtr , ConstSharedPtr , and const Msg & behave differently when Intra-process communication is enabled: ConstSharedPtr can achieve zero-copy delivery, whereas SharedPtr and Const Msg & may introduce additional copies. It is also worth noting that SharedPtr is actually deprecated, see Deprecated subscription callback signatures are still compiling · Issue #2972 · ros2/rclcpp · GitHub
For reference, I will also put the link of the design document here.

Testing with Nav2

In addition to the minimal example, I also tested different combinations of IPC settings, RMW implementations, and subscription callback types in a full Nav2 setup.

Test Environment

OS: Ubuntu 24.04.3 LTS (x86_64)
Kernel: 6.8.0-88-generic
CPU: Intel i7-12700H (20 cores)
GPU: Intel Iris Xe + NVIDIA RTX 3050 Ti Mobile
Memory: 32 GB

Setup A: Full Simulation (With RViz and Gazebo enabled)

  • launch command: ros2 launch nav2_bringup tb4_simulation_launch.py
  • Baseline measurements correspond to the latest rolling commit at the time of testing.
  • The CPU Usage reported here refers specifically to the nav2_container process

CPU Usage Result

Middleware Configuration CPU Usage (%)
Zenoh Baseline 4.5%
Zenoh IPC enabled (with SharedPtr used in most of the stack) 6.0%
Zenoh IPC + ConstSharedPtr 5.8%
CycloneDDS Baseline 18.3%
CycloneDDS IPC enabled (with SharedPtr used in most of the stack) 17.8%
CycloneDDS IPC + ConstSharedPtr 16.1%
FastDDS Baseline 6.8%
FastDDS IPC enabled (with SharedPtr used in most of the stack) 7.8%
FastDDS IPC + ConstSharedPtr 7.8%

As shown in the table, enabling intra-process communication leads to increased CPU usage for FastDDS and Zenoh, while CycloneDDS shows a reduction in CPU usage.

Setup B: Loopback Simulation (With RViz and Gazebo disabled)

  • launch command: ros2 launch nav2_bringup tb4_loopback_simulation_launch.py use_rviz:=false
  • Baseline measurements correspond to the latest rolling commit at the time of testing
  • The CPU Usage reported here refers specifically to the nav2_container process

CPU Usage Result

Middleware Configuration CPU Usage (%)
Zenoh Baseline 0.8%
Zenoh IPC enabled (with SharedPtr used in most of the stack) 0.8%
Zenoh IPC + ConstSharedPtr 0.8%
CycloneDDS Baseline 1.6%
CycloneDDS IPC enabled (with SharedPtr used in most of the stack) 1.7%
CycloneDDS IPC + ConstSharedPtr 1.7%
FastDDS Baseline 1.2%
FastDDS IPC enabled (with SharedPtr used in most of the stack) 1.2%
FastDDS IPC + ConstSharedPtr 1.2%

Under the loopback simulation with RViz and Gazebo disabled, CPU usage remains effectively unchanged across different IPC and subscription callback settings for all three middleware.

Discussion

Based on the first experiments above, I would expect some performance gains when enabling intra-process communication and using ConstSharedPtr . However, this expectation was not met when testing in a full Nav2 setup, and in the full simulation setup we even observed performance drop with FastDDS and Zenoh.

Therefore, I would like to invite community members as well as the ros2 team to share their experiences and insights on these results.

9 Likes

Hi, very nice analysis work. One small question arises when reading the quoted statement: what do you mean exactly for “performance gains”? CPU usage reduction? From latency perspective, using intra-process with zero copy should be effective indeed.

About Rviz enabled/disabled comparison, one neat trick is to compose Rviz so that it can subscribe in zero-copy mode. I still have to check the type of subscription signatures used in default plugins, but one can then also load custom plugins optimized for IPC or even better with Type Adaptation support.

Hi!

Yes, exactly, I was referring to reduced CPU usage

I think this is still a WIP in rviz(?) https://github.com/ros2/rviz/pull/1140

I think what I want to discuss in this post can be summarized as follows:
Can we expect CPU usage reduction from enabling IPC? If yes, what are the required conditions?

  • RMW Implementation?
  • Subscription callback signatures?
  • Fully encapsulated setup? (i.e. no interprocess subscriptions such as rviz or rosbag)

Based on my tests in Nav2, I observe a CPU usage reduction when using CycloneDDS in Setup A, with an additional reduction when switching all callbacks to ConstSharedPtr. However, I do not see the same pattern when using other RMWs.

Well - I ended up in statically-composing Rviz2 in the same process space on some of my setups. This PR is more related to have component semantics for Rviz (and more in particular, the PR proposes to embed a component container inside Rviz2 main executable).

1 Like

Hey @mini-1235, I haven’t forgotten about this. I took a little time to unplug and spend time with the family over the holidays and I will give this a look in the next few days.

Thanks for all of your work characterizing the behaviors and the excellent report!

2 Likes

Thanks for the update! No worries at all :smile:

Thanks for collecting the data and starting the conversation.

Some cursory thoughts:

  • As mentioned above, consider comparing latency numbers too as that’s where you’ll see a lot of the improvements, though if you don’t care about the latency as much then I suppose it may not matter, but it might help you confirm something else weird isn’t happening.
  • How did you collect the cpu utilization? Did you wait for steady-state of some kind and then average a window of time? Did you run the experiment multiple times for each case? The differences in CPU utilization are close enough that it’s not clear to me weather it’s within the noise of the measurements or not (no error bars). Is this on a soft/hard real-time OS or just an OS with a standard scheduler, etc. Put another way, how have you convinced yourself that you’re measuring the intra-process vs inter-process setting or something else?
  • Related to clarifying how you’re collecting CPU statistics, looking at call graphs or flame charts or something before and after intra-process is turned on would be interesting, because it may be possible that inter-process comms (and copying messages) are not a significant source of CPU utilization in the first place.
  • How did you determine the number of copies? Just code review or did you instrument it somehow?
  • I would need to look again, but it may be that intra-process is entirely “extra” work on top of inter-process, i.e. if we never avoid sending messages to the rmw layer for inter-process comms (e.g. when there’s no non-intra-process subscribers), then the intra-process comms is just more work on top of inter-process comms which always occurs. Ideally, the rmw layers would use less resources when you publish to them with no inter-process subscriptions present, but I’m not 100% sure about that either (this could be impacted by what QoS setting you’re using too, e.g. durability transient-local requires the inter-process to do most of the work when publishing, even when there are currently no subscriptions, due to needing to keep stuff around for late joining subscriptions).
    • Follow up, I looked at the code and there’s definitely still logic to avoid inter-process when publishing under certain conditions (no inter-process subscriptions and not using transient-local durability), but I never got a chance to properly look at this pull request that added support for transient-local durability in intra-process ( Fix transient local IPC publish by jefferyyjhsu · Pull Request #2708 · ros2/rclcpp · GitHub ), so that might be something to have a look at, if we’re getting weird performance from intra-process.
  • When talking about this stuff, I’d recommend explicitly using the phrase “intra-process comms” rather than “IPC”, as everywhere else on the internet IPC means inter-process comms (see Inter-process communication - Wikipedia ), which is sure to be a source of confusion. If you must abbreviate it, use something like “intra-”, “intra-pc”, etc. In my experience, people constantly confuse inter-process and intra-process even in the context of ROS and because of that I always just type out intra-process and inter-process explicitly. It’s worth the effort imo.
4 Likes

I took a look at the example you used for the test. You seem to use int32 as message type. As the cost of copying this is basically zero I would not expect any difference using zero copy.

Try using a big message (> 10 MB) then you should see a difference in CPU load.

Edit:

@wjwwood pointed out that the test was performed using the nav2 stack. This leads to the question if the nav stack actually sends ‘big’ messages in a high frequency. Judging from the results this seems not to be the case.

3 Likes

Hi @JM_ROS @wjwwood, thanks for your reply. It’s been a pretty busy week for me, but I finally had some time to think through / test the suggestions you mentioned.

Sure, that make sense. I am trying to update the post to replace “IPC” with “intra-process comms”, but I am unable to find an edit button. If editing is still possible, please let me know and I’ll be happy to update it

Inspired by the demo in ros2_documentation, I am printing the address in each subscription callback.

Take const std_msgs::msg::Int32 & msg as an example. It enters the condition here: rclcpp/rclcpp/include/rclcpp/any_subscription_callback.hpp at c85ff926d2ea6c3fb8b8075e28b93632a791fd5c · ros2/rclcpp · GitHub
Adding

std::holds_alternative<ConstRefROSMessageCallback>(callback_variant_) ||
std::holds_alternative<ConstRefWithInfoROSMessageCallback>(callback_variant_)

to use_take_shared_method() seems to match the behavior of shared_ptr<const T>.

To my knowledge, the largest messages in the Nav2 stack are the global and local costmaps. With the default settings, the global costmap is published at 1 Hz and the local costmap at 2 Hz, and their sizes are on the order of a few hundred kilobytes. We are also subscribing to sensor data from Gazebo. I checked those as well, they are published at around 10 Hz with message sizes on the order of tens of kilobytes.

I tried writing a quick demo that publishes larger messages, and I can clearly see a non-trivial reduction in CPU usage. When switching the qos to transient local, I do not observe any difference between inter-process and intra-process communication. I think this also explains the results in setup B, since the costmaps topics are published with transient local.

Yes. I usually wait about 1–2 minutes for the system to reach a steady state, and then compute the average CPU utilization. When preparing the report, I also repeated each experiment multiple times, and on different days. The error bars should be approximately ±0.3 and they are conducted on an OS with a standard scheduler.

Finally, to help explain the behavior I observed in setup A and to move this discussion forward, I also collected some flamegraphs. I have uploaded the SVG files here for convenience: https://drive.google.com/drive/folders/1A75ddUFnMZlh8DB0X-_BsRWAM-ovSJ-i?usp=sharing

zenoh (Intra-process comms off)


zenoh (Intra-process comms on)


Differential flamegraphs (./diff off on)

From the flamegraph and perf top, it seems that it is related to rclcpp::experimental::IntraProcessManager::matches_any_publishers() (?)

cyclone (Intra-process comms off)

cyclone (Intra-process comms on)


Differential flamegraphs (./diff off on)

Similar to the zenoh results, this seems to point to the rclcpp::experimental::IntraProcessManager::matches_any_publishers() as well.

I hope this helps move the discussion forward. Looking forward to your thoughts!

4 Likes

`matches_any_publishers()` seems to be called on subscription side to check if message is oging to be delivered intra-process - and thus circumvent the “double delivery“ issue. It has a for loop inside, and various “locks”…

Fast link for whoever is curious enough: rclcpp repo (snippet).

It would be interesting to understand what is the hottest line inside that function - for sure on large systems, having such for loop running each and every message in every subscriber may seem lot of work. On the other hand, I understand why it is needed and don’t know how that piece of code could be better.

Maybe a cache mechanism with a flag that is invalidated only at graph change (i.e. pubs added/removed from the graph)?

1 Like

We could do an obvious and fast optimization : on call of add_publisher save the gid to a new gid vector and remove it from the vector on call of remove_publisher. This would get rid of all the weak pointer lock calls which are somewhat expensive.

publishers_ is also a map which is not the cache friendliest structure to iterate so using a vector here would give an additional speed up.

2 Likes