Agnocast & Callback Isolated Executor: True Zero-Copy IPC and Middleware-Transparent Scheduling for ROS 2

We are excited to introduce Agnocast and the Callback Isolated Executor — two technologies that dramatically improve the performance of ROS 2 systems through true zero-copy IPC and middleware-transparent scheduling. Both are now being fully adopted across Autoware, the open-source autonomous driving platform powering a wide range of commercial deployments, including robotaxis, autonomous buses, and cargo vehicles worldwide. For details, please refer to:

Agnocast: True Zero-Copy IPC for All Message Types

Agnocast is an rclcpp-compatible true zero-copy IPC middleware for ROS 2 that supports all ROS message types, including message structs already generated by rosidl. Existing solutions only support static-sized messages, falling back to serialization and copying for unsized message like PointCloud2 . Agnocast removes this limitation with no serialization overhead. The API is rclcpp-compatible, so migrating a topic typically requires changing only a few lines per node.

Callback Isolated Executor: Middleware-Transparent Scheduling

Callback Isolated Executor (CIE) is a drop-in replacement for the standard ROS 2 executor that establishes a persistent one-to-one mapping between each callback and a dedicated OS thread. Standard executors multiplex callbacks onto shared threads above the OS scheduler, creating a nested scheduling problem that prevents applying OS-level guarantees to individual callbacks. CIE removes this middleware layer, enabling per-callback OS scheduling control for both performance and safety.

Cooperation between Agnocast & CIE

Agnocast and CIE can be used independently, but we recommend coordinating both to maximize performance. Due to its proprietary messaging mechanism, Agnocast requires a dedicated executor, and this functionality is included in the CIE shipped with Agnocast. Of course, we also offer standalone CIE packages.

Proven in Autonomous Vehicle Deployments

We demonstrate the impact of applying Agnocast and CIE to a large-scale ROS 2 autonomous driving system below.

Agnocast
The figure below plots communication latency from message publication to subscriber callback invocation across different message sizes for Agnocast and the compared IPC mechanisms. Agnocast maintains consistent communication latency regardless of message size, even for unsized message types like PointCloud2.

CIE
The figure below shows how much the response times of five critical paths were improved by optimizing them with CIE, compared with running under Linux’s default scheduler CFS without any special tuning (gray line). These results were obtained from measurements taken while the robotaxi was operating on public roads, with the horizontal axis representing elapsed time in seconds. As a result, the worst-case response time (measured) improved by about 5x, demonstrating a significant enhancement in real-time performance.

Get Started

All packages have already been deployed to the official ROS 2 Build Farm and are awaiting distribution. Until then, the current distribution model is as follows:

The following components are available through our PPA.

sudo apt install agnocast-heaphook-v2.2.0 agnocast-kmod-v2.2.0

The core Agnocast library (including CIE) currently requires a source build.

For detailed setup instructions and sample applications, please refer to the README in the repository.

Learn More

Agnocast

Callback Isolated Executor

20 Likes

Thanks for sharing! This looks very promising!
I have mainly one question about CPU load comparison between middlewares. Given the same conditions, how is the CPU consumption of the nodes affected (or improved) with Agnocast compared to other implementations? One could expect that it improves since zero copy implies fewer operations, but since CIE has one real thread per callback, that could increase the CPU usage (?) Do you have any numbers on that?

2 Likes

@AtsushiYano Can you please explain a few things for me?

  1. Agnocast is a separate communications system, not going through the RMW layer. This makes it incompatible with all the existing ROS tooling, including tools such as rosbag2 and rviz, and also means it won’t benefit from improvements to the RMW layer - such as the upcoming memory architecture agnostic improvements that will allow ROS messages to be views onto GPU buffers. Why did you ignore the RMW layer?
  2. You state that CIE “removes this middleware layer”. Does this mean that it is removing the ROS executor infrastructure and effectively allowing nodes to execute directly in OS processes, rather than through the executor abstraction?
  3. Your LinkedIn post says that “It has been officially decided that these middleware will be adopted across the entire Autoware project, replacing the existing ROS 2 stack.” Does this mean that Autoware is no longer using ROS?
3 Likes

@gbiggs
Thank you for your thoughtful questions and for taking the time to engage with our work!

ROS 2 enables the easy construction of robotics systems across a very wide range of environments. We position Agnocast as a library for users who want to further push performance tuning to its limits in specific environments. In other words, users who encounter performance bottlenecks in systems built on the existing ROS 2 stack now have the option to migrate part or all of their middleware stack to Agnocast. Since the user interface is designed to closely mirror rclcpp, migration requires minimal effort. At this stage, Agnocast only supports C++ clients on Linux, though we may expand platform support in the future.

  1. Agnocast is a separate communications system, not going through the RMW layer. This makes it incompatible with all the existing ROS tooling, including tools such as rosbag2 and rviz, and also means it won’t benefit from improvements to the RMW layer - such as the upcoming memory architecture agnostic improvements that will allow ROS messages to be views onto GPU buffers. Why did you ignore the RMW layer?

For details on why Agnocast is not integrated as an RMW implementation, please refer to this. Compatibility with the existing ROS 2 ecosystem is ensured through the Bridge feature. This feature mediates between Agnocast’s communication world and the RMW-based communication world, enabling seamless interoperation between them. From the user’s perspective, the experience remains the same as working with a standard ROS 2 system — users can transparently interact with ROS 2 applications that do not use Agnocast, including tools such as rviz and rosbag, without any additional effort.

  1. You state that CIE “removes this middleware layer”. Does this mean that it is removing the ROS executor infrastructure and effectively allowing nodes to execute directly in OS processes, rather than through the executor abstraction?

This statement does not imply the removal of the ROS executor infrastructure. Rather, it refers to eliminating scheduling behavior from the middleware layer in order to resolve issues related to nested scheduling. Specifically, CIE achieves semantic equivalence between OS-level scheduling and executor-level scheduling by constraining the executor to manage only a single callback. For the most accessible explanation of the nested scheduling problem and how CIE addresses it, please refer to our paper.

  1. Your LinkedIn post says that “It has been officially decided that these middleware will be adopted across the entire Autoware project, replacing the existing ROS 2 stack.” Does this mean that Autoware is no longer using ROS?

To be precise, the announcement means that Autoware will provide users with the option to fully adopt Agnocast, rather than forcing an immediate migration. As described in the autoware_agnocast_wrapper README, building Autoware with the default build options continues to produce a system based on the conventional ROS 2 stack. By enabling the Agnocast build option, users can build Autoware with Agnocast as its communication foundation. TIER IV, the primary contributor to Autoware, has decided to build its autonomous driving systems with the Agnocast-enabled option going forward. It is also important to note that even with the Agnocast-enabled build, Autoware continues to depend on the existing RMW communication stack for inter-host communication and for interacting with tools such as rviz and rosbag that rely on RMW-based communication.

2 Likes

@charlielito

Thanks for the great question!

RMW/DDS layer’s CPU overhead
In typical ROS 2 applications, the RMW layer (in our environment, CycloneDDS) consumes a non-trivial amount of CPU. Most significantly, communication goes through serialization/deserialization and multiple data copies via socket buffers even for intra-host scenarios. Additionally, the DDS implementation spawns multiple background threads that are periodically active for protocol maintenance — heartbeat transmission, receive polling, garbage collection, thread liveness monitoring, etc. These are necessary functions for DDS as a distributed middleware over unreliable networks, but they add constant CPU overhead, especially when aggregated across multiple DDS participants in a system.
(Note: we use “DDS” below to refer to the RMW implementation for brevity, though this applies to the RMW layer in general.)

How Agnocast addresses this
Agnocast replaces topic communication with true zero-copy via shared memory, eliminating the serialization/deserialization and copy overhead — the dominant source of CPU consumption, especially for large messages like point clouds. This works with your existing rclcpp::Node and should already provide a significant CPU reduction. For further optimization, switching to agnocast::Node bypasses the RMW layer entirely — no DDS participant is created and none of the DDS background threads are spawned — eliminating the remaining runtime overhead as well.

Reference measurement: Autoware’s shape_estimation node
We did some rough profiling using Intel VTune on Autoware’s shape_estimation node (with agnocast::Node) and observed around 80% CPU time reduction — the RMW/DDS runtime functions that had dominated the baseline profile were no longer present. This node performs relatively lightweight geometric fitting compared to the size of its input point cloud data, so the DDS overhead was particularly prominent in this case. The improvement will vary by node depending on the ratio of communication overhead to application logic, but we would expect a meaningful reduction in general. We are planning to run more systematic measurements soon and will share the results once available.

On CIE’s thread-per-callback model
Regarding your concern about CIE potentially increasing CPU usage — the impact is minimal. CIE’s callback threads are purely reactive: they block on epoll_wait and only wake up when an event is delivered to their file descriptor, consuming no CPU while sleeping. This is fundamentally different from DDS background threads, which wake up periodically for protocol maintenance regardless of application activity. Simply having more threads does not translate to higher CPU usage as long as they are not actively running.

2 Likes

Nice performance gains!

Are you implying that autoware uses zero packages from the existing ROS ecosystem? (Except external like rosbag & rviz). Because they won’t have Agnocast implementations I guess.

We are happily using some autoware packages (mainly lidar drivers and lidar filtering). Can we still use these in the future or do we need to resort to using (partly) Agnocast?
Right now they work with any middleware of our choosing.

I am wondering, the path chosen was to implement agnocast as a API that is visible to the end user.

Why did you not implement an rmw backend using agnocast ?

I had a brief look at the CIE implementation and notices that it still uses the internal waitset approach. As far as I can see you are still maintaining a global waitset for all callback groups, or am I missing something an you are only building the waitset per thread from the callback group ?

I guess as you are on humble the events callback infrastructure is not in place yet and you can’t use it. Note that especially for high frequency applications the events approach shows a big reduction in CPU usage and improves latency across the board. Did you do any measurements were you compared the events executor to the CIE using a newer ros version ?

P.S.: I find the definition of ROS2 mainline in callback_isolated_executor/docs/comparison_with_other_executors.md at main · autowarefoundation/callback_isolated_executor · GitHub rather sketchy…

1 Like