Update on ROS native buffers

Hello ROS community,

as you may have heard, NVIDIA has been working on proposing and prototyping a mechanism to add support for native buffer types into ROS2, to allow ROS2 to natively support APIs to use accelerated buffers like CUDA or Torch tensors efficiently. We had briefly touched on this in a previous discourse post. Since then, a lot of design discussions in the SIG PAI, as well as prototyping on our side has happened, to turn that outline into a full-fledged proposal and prototype.

Below is a rundown of our current status, as well as an outlook of where the work is heading. We are looking forward to discussions and feedback on the proposal.

Native Buffers in ROS 2

Problem statement

Modern robots use advanced, high-resolution sensors to perceive their environment. Whether it’s cameras, LIDARs, time-of-flight sensors or tactile sensor arrays, data rates to be processed are ever-increasing.

Processing of those data streams has for the most part moved onto accelerated hardware that can exploit the parallel nature of the data. Whether that is GPUs, DSPs, NPUs/TPUs, ASICS or other approaches, those hardware engines have some common properties:

  • They are inherently parallel, and as such well suited to processing many small samples at the same time
  • They are dedicated hardware with dedicated interfaces and often dedicated memory

The second property of dedicated memory regions is problematic in ROS2, as the framework currently does not have a way to handle non-CPU memory.

Consider for example the sensor_msgs/PointCloud2 message, which stores data like this:

uint8[] data         # Actual point data, size is (row_step*height)

A similar approach is used by sensor_msgs/Image. In rclcpp, this will map to a member like

std::vector<uint8_t> data;

This is problematic for large pieces of data that are never going to be touched by the CPU. It forces the data to be present in CPU memory whenever the framework handles it, in particular for message transport, and every time it crosses a node boundary.

For truly efficient, fully accelerated pipelines, this is undesirable. In cases where there are one or more hardware engines handling the data, it is preferable for the data to stay resident in the accelerator, and never be copied into CPU memory unless a node specifically requests to do so.

We are therefore proposing to add the notion of pluggable memory backends to ROS2 by introducing a concept of buffers that share a common API, but are implemented with vendor-specific plugins to allow efficient storage and transport with vendor-native, optimized facilities.

Specifically, we are proposing to map uint8[] in rosidl to a custom buffer type in rclcpp that behaves like a std::vector<uint8_t> if used for CPU code, but will automatically keep the data resident to the vendor’s accelerator memory otherwise. This buffer type is also integrated with rmw to allow the backend to move the buffer between nodes using vendor-specific side channels, allowing for transparent zero-copy transport of the data if implemented by the vendor.

Architecture overview

Message encoding

The following diagram shows the overview of a message containing a uint8[] array, and how it is mapped to C++, and then serialized:

It shows the following parts, which we will discuss in more detail later:

  • Declaration of a buffer using uint8[] in a message definition as before
  • Mapping onto a custom buffer type in rclcpp, called Buffer<T> here
  • The internals of the Buffer<T> type, in particular its std::vector<T>-compatible interface, as well as a pointer to a vendor-specific implementation
  • A vendor-specific backend providing serialization, as well as custom APIs

The message being encoded into a vendor-specific buffer descriptor message, which is serialized in place of the raw byte array in the message

Choice of uint8[] as trigger

It is worth noting the choice to utilize uint8[] as a trigger to generate Buffer<T> instances. An alternative approach would have been to add a new Buffer type to the IDL, and to translate that into Buffer<T>. However, this would not only introduce a break in compatibility of the IDL, but also force the introduction of a sensor_msgs/PointCloud3 and similar data types, fracturing the message ecosystem further.

We believe the cost of maintaining a std::vector compatible interface and the slight loss of semantics is outweighed by the benefit of being drop-in compatible with both existing messages and existing code bases.

Integration with rclcpp (and rclpy and rclrs)

rclcpp exposes all uint8[] fields as rosidl_runtime_cpp::Buffer<T> members in their respective generated C++ structs.

rosidl_runtime_cpp::Buffer<T> has a fully compatible interface to std::vector<T>, like size(), operator[](size_type pos) etc.. If any of the std::vector<T> APIs are being used, the vector is copied onto the CPU as necessary, and all members work as expected. This maintains full compatibility with existing code - any code that expects a std::vector<T> in the message will be able to use the corresponding fields as such without any code changes.

In order to access the underlying hardware buffers, the vendor-specific APIs are being used. Suppose a vendor backend named vendor_buffer_backend exists, then the backend would usually contain a static method to convert a buffer to the native type. Our hypothetical vendor backend may then be used as follows:

void topic_callback(const msg::MessageWithTensor & input_msg) {
  vendor_native_handle input_h = vendor_buffer_backend::from_buffer(msg.data);

  msg::MessageWithTensor output_msg =     
    vendor_buffer_backend::allocate<msg::MessageWithTensor>();

  vendor_native_handle output_h = 
    vendor_buffer_backend::from_buffer(output_msg.data);

  output_h = input_h.some_operation();

  publisher_.publish(output_msg);
}

This code snippet does the following:

First, it extracts the native buffer handle from the message using a static method provided by the vendor backend. Vendors are free to provide any interface they choose for providing this interface, but would be encouraged to provide a static method interface for ease of use.

It then allocates the output message to be published using another vendor-specific interface. Note that this allocation creates an empty buffer, it only sets up the relationship between output_msg.data and the vendor_buffer_backend by creating an instance of the backend buffer, and registering it in the impl field of rosidl_runtime_cpp::Buffer<T> class.

The native handle from the output message is also extracted, so it can be used with the native interfaces provided.

Afterwards, it performs some native operations on the input data, and assigns the result of that operation to the output data. Note that this is happening on the vendor native data types, but since the handles are linked to the buffers, the results show up in the output message without additional code.

Finally, the output message is published the same as any other ROS2 message. rmw then takes care of vendor-specific serialization, see the following sections on details of that process.

This design keeps any vendor-specific code completely out of rclcpp. All that rclcpp sees and links against is the generic rosidl_runtime_cpp::Buffer<T> class, which has no direct ties to any specific vendor. Hence there is no need for rclcpp to even know about all vendor backends that exist.

It also allows vendors to provide specific interfaces for their respective platforms, allowing them to implement allocation and handling schemes particular to their underlying systems.

A similar type would exist for rclpy and rclrs. We anticipate both of those easier to implement due to the duck typing facilities in rclpy, and the traits-based object system in rclrs, respectively, which make it much easier to implement drop-in compatible systems.

Backends as plugins

Backends are implemented as plugins using ROS’s pluginlib. On startup, each rmw instance scans for available backend-compatible plugins on the system, and registers them through pluginlib.

A standard implementation of a backend using CPU memory to offer std::vector<T> compatibility is provided by default through the ROS2 distribution, to ensure that there is always a CPU implementation available.

Additional vendor-specific plugins are implemented by the respective hardware vendors. For example, NVIDIA would implement and provide a CUDA backend, while AMD might implement and provide a ROCm backend.

Backends can either be distributed as individual packages, or be pre-installed on the target hardware. As an example, the NVIDIA Jetson systems would likely have a CUDA backend pre-installed as part of their system image.

Instances of rosidl_runtime_cpp::Buffer<T> are tied to a particular backend at allocation time, as illustrated in the section above.

Integration with rmw

rmw implementations can choose to integrate with vendor backends to provide accelerated transports through the backends. rmw implementations that do not choose to integrate with backends, or any existing legacy backends, automatically fall back onto converting all data to CPU data, and will continue working without any changes.

A rmw implementation that chooses to integrate with vendor backends does the following. At graph startup when publishers and subscribers are being created, each endpoint shares a list of installed backends, alongside vendor-specific data to establish any required side channels, and establishes dedicated channels for passing backend-enabled messages based on 4 different data points:

  • The message type for determining if it contains any buffer-typed fields
  • The list of backends supported by the current endpoint
  • The list of backends supported by the associated endpoint on the other side
  • The distance between the two endpoints (same process, different process, across a network etc.)

rmw can choose any mechanism it wants to perform this task, since this step is happening entirely internal to the currently loaded rmw implementation. Side channel creation is entirely hidden inside the vendor plugins, and not visible to rmw.

For publishing a message type that contains buffer-typed fields, if the publisher and the subscriber(s) share the same supported backend list, and there is a matching serialization method implemented in the backend for the distance to the subscriber(s), then instead of serializing the payload of the buffer bytewise, the backend can choose to use a custom serialization method instead.

The backend is then free to serialize into a ROS message type of its choice. This backend-custom message type is called a descriptor. It should contain all information the backend needs to deserialize the message at the subscriber side, and reconstruct the buffer. This descriptor message may contain pointer values, virtual memory handles, IPC handles or even the raw payload if the backend chooses to send that data through rmw.

The descriptor message can be inspected as usual if desired since it is just a normal ROS2 message, but deserializing requires the matching backend. However, since the publisher knows the backends available to the subscriber(s), it is guaranteed that a subscriber only receives a descriptor message if it is able to deserialize it.

Integration with rosidl

While the above sections show the implications visible in rclcpp, the bulk of the changes necessary to make that happen go into rosidl. It is rosidl that is generating the C++ message structures, and hence rosidl that would map to the Buffer type instead of std::vector. Hence the bulk of the work done in order to get this scheme to work is done in rosidl, not in rclcpp.

Layering semantics on top

Having only a buffer is not very useful, as most robotics data has higher level semantics, like images, tensors, point clouds etc..

However, all of those data types ultimately map to one or more large, contiguous regions of memory, in CPU or accelerator memory.

We also observe that a healthy ecosystem of higher level abstractions already exists. There is PCL for point clouds, Torch for tensor handling etc.. Hence, we propose to not try to replicate those ecosystems in ROS, but instead allow those ecosystems to bridge into ROS, and use the buffer abstraction as their backend for storage and transport.

As a demonstration of this, we are providing a Torch backend that allows linking (Py)Torch tensors to the ROS buffers. This allows users to use the rich ecosystem of Torch to perform tensor operations, while relying on the ROS buffers to provide accelerator-native storage and zero-copy transport between nodes, even across processes and chips if supported by the backend.

The Torch backend does not provide a raw buffer type itself, but relies on vendors implementing backends for their platforms (CUDA, ROCm, TPUs etc.). The Torch backend then depends on the vendor-specific backends, and provides the binding of the low-level buffers to the Torch tensors. The coupling between the Torch backend and the hardware vendor buffer types is loose, it is not visible from the node’s code, but is established after the fact.

From a developer’s perspective, all of this is hidden. All a developer writing a Node does is to interact with a Torch buffer, and it maps to the correct backend available on the current hardware automatically. An example of such a code could look like this:

void topic_callback(const msg::MessageWithTensor & input_msg) {
  // extract tensor from input message
  torch::Tensor input_tensor =
    torch_backend::from_buffer(input_msg.tensor);

  // allocate output message
  msg::MessageWithTensor output_msg =
    torch_backend::allocate<MessageWithTensor>();

  // get handle to allocated output tensor
  torch::Tensor & output_tensor =
    torch_backend::from_buffer(output_msg.tensor);

  // perform some torch operations
  output_tensor = torch.abs(input_tensor);

  // publish message as usual
  publisher_.publish(output_msg);
}

Note how this code segment is using Torch-native datatypes (torch::Tensor), and is performing Torch-native operations on the tensors (in this case, torch.abs). There is no mention of any hardware backend in the code.

By keeping the coupling loose, this node can run unmodified on NVIDIA, AMD, TPU or even CPU hardware, with the framework, in this case Torch, being mapped to the correct hardware, and receiving locally available accelerations for free.

Prior work

NITROS

https://docs.nvidia.com/learning/physical-ai/getting-started-with-isaac-ros/latest/an-introduction-to-ai-based-robot-development-with-isaac-ros/05-what-is-nitros.html

NITROS is NVIDIA’s implementation of a similar design based on Type Negotiation. It is specific to NVIDIA and not broadly compatible, nor is it currently possible to layer hardware-agnostics frameworks like Torch on top.

AgnoCast

https://github.com/tier4/agnocast

AgnoCast creates a zero-copy regime for CPU data. However, it is limited to CPU data, and does not have a plugin architecture for accelerator memory regions. It also requires kernel modifications, which some may find intrusive.

Future work

NVIDIA has been working on this proposal, alongside a prototype implementation that implements support for the mechanisms described above. We are working on CPU, CUDA and Torch backends, as well as integration with the Zenoh rmw implementation.

The prototype will move into a branch on the respective ROS repositories in the next two weeks, and continue development into a full-fledged implementation in public.

In parallel, a dedicated working group tasked with formalizing this effort is being formed, with the goal of reaching consensus on the design, and getting the required changes into ROS2 Lyrical.

13 Likes

Hi Karsten,

Thank you for sharing this comprehensive proposal and for the significant work you’re putting into prototyping the native buffer support for ROS2. It’s great to see NVIDIA taking the initiative to address the critical performance bottlenecks that roboticists face when working with accelerated hardware.

I think that one of the key strengths of your solution is the ability to send messages that mix different memory types within a single message structure and in an almost transparent way. In your Image example I understand that the metadata (header, height, width) is in CPU memory, while the raw image is in accelerator memory. This is a significant improvement over current approaches that force the full message content to be in the same memory space before the middleware can send it.

However, I see a potential limitation in the current design when a publisher needs to send data to both local and remote subscribers simultaneously.

In your architecture, the backend serialization mechanism is different whether the subscriber is or not in the same memory domain. A scenario with both local and remote subscribers on a same topic seems problematic, unless the system performs two separate serializations (one with descriptors for the local subscriber, one with actual data for the remote subscriber) and manages two separate publications (potentially on different topics). This would likely fragment the topic space and complicate graph construction.

I believe there’s a more elegant solution possible when using Zenoh as the RMW implementation. Zenoh has native support for shared memory transport, but more importantly, its design allows payloads to be composed of multiple slices from different memory regions (CPU memory, GPU memory, etc.).

Here’s how this could work:

The metadata in CPU memory is serialized in a heap slice, while the data in accelerator memory is converted (without copy) into a SHM slice. The payload to be published is composed by those 2 slices. Then Zenoh transparently handles the sending of the message to the subscribers:

  • for local subscribers in the same SHM domain, the heap slice is sent over the network, while the SHM slice remains in SHM.

  • for remote subscribers, both slices are sent over the network, possibly with a copy from the SHM slice into an IO buffer.

All the subscribers will receive a similar payload which can be seen as a set of slices (via an iterator) or as a single bytes buffer.

This approach maintains the zero-copy benefits for local accelerated pipelines while gracefully degrading to necessary copies only when crossing network boundaries, all transparently to the application developer.

Given that you’re already working on integrating this with rmw_zenoh, we (ZettaScale) would be very interested in collaborating on implementing this multi-slice approach. Zenoh’s architecture is particularly well-suited to realize the full vision of your proposal while addressing the mixed subscriber scenario. Would you be open to discussing this further?

Looking forward to your thoughts and to the formation of the working group.

5 Likes

Hi Julien,

thanks for taking the time to go through the proposal.

You’re absolutely right that there are more nuances to communicating with mixed peers.

Before getting into that, let me point out that the landscape of “shared memory” is a bit more complex. On the NVIDIA side alone, we have

  • the legacy CUDA IPC API exchanging memory handles
  • the newer VMM API that is more akin to SHM, and also allows some network-level exchange
  • the GPUDirect RDMA API, allowing for DMA access over PCIe and some network fabrics
  • the NvStreams / NvSci API that allows communication between different IPs, and across VMs
  • lastly there’s UCX which attempts to unify the above, and also includes other vendor’s protocols (ROCm etc.)

Each of those have nontrivial semantics, and the connection topologies they support are not easy to detect. So between any two peers, it’s not easy to decide which of those protocols are applicable, and which one to choose. For example, even for the VMM / SHM case, we may have processes on the same host, but with visibility into different GPUs. So just blindly sending VMM handles won’t work, unless we know that a) they are on the same host and b) can actually utilize the handle.

Going back to your question, for the specific case of having mixed subscribers with arbitrary connections between them, our proposal is to let the vendor plugins make the decision which communication to choose. When a new subscriber, or more precisely, a new endpoint, connects, a callback is invoked, and the current endpoint computes a group ID for that endpoint. RMW then uses the group ID to bundle up subscribers reachable by the same transport, and serialize only once for them, to avoid serializing multiple times.

Since the knowledge of who is compatible with what is enclosed in the vendor plugin, vendors can use their knowledge of the system to make close to optimal decisions on which transport to use for any given constellation.

Now that being said, it is completely legal for a RMW implementation to use their own mechanisms for this, for example if no vendor plugin is loaded. Since the existence of the buffer object is known to RMW, and it has access to the vendor backends, any RMW implementation can choose to attempt inlining the memory. So if Zenoh supports slices of different memory types, and it cannot find a vendor transport plugin to perform custom serialization, it might very well attempt to do this on its own.

However, in terms of ecosystem, I believe it is easier for everyone to put this responsibility on respective hardware vendors.

For one, it makes life much easier for RMW implementers, they don’t have to worry about every single combination of vendor memories and their transports, but can delegate this entirely to the vendors.

It also makes life easier for the vendors, since they only have to implement their respective plugins once. They can build out an arbitrary complex set of rules and dispatches to select the most appropriate transport for any given endpoint pairing, and it just works automatically in any RMW implementation that implements the delegation logic.

So overall, the separation of concerns reduces the load on the ecosystem overall, and allows each participant to focus on their respective areas of expertise.

But again, it is ultimately up to the RMW implementers to do what they think is best. They can choose to do nothing and ignore the whole buffer world, and just keep serializing as they did before. They can choose to implement the vendor backend logic, and delegate the logic. They can choose to implement their own logic to handle the buffers, and ignore the plugins. Or they can choose to mix and match any of those schemes. As long as they adhere to the RMW interface, they will still be a compliant RMW implementation.

All we do is to incentivize RMW implementers to use the plugins by proposing a significant gain for relatively little cost.

1 Like

Thanks Karsten for your very good idea.

I also think it’s a problem that ROS2 can not natively handle non-CPU memory.

I used to do:

  1. write a new ROS2 message which include the handle, not data[].
  2. send this new ROS2 message between nodes
  3. handle map the memory that multiple accelerators can visit

but this includes many customized operations and can be used only in some process.

As for your solution, “establishes dedicated channels for passing backend-enabled messages based on 4 different data points“, is that means need additional ROS2 topics to confirm the backend before communication?

And do you guys already create github repo for this? I am also interested to contribute.

In the case that a buffer is actually a torch::Tensor (and nothing else, right?), how do you prevent users from interpreting it differently, e.g., if we had an OpenCV object/buffer through a high-level OpenCV backend similar to the Torch backend? You can definitely detect when the wrong backend is used, but this creates friction.

The ROS way to solve this is, of course, to “encode” this higher-level semantic in the message, e.g., by creating a torch_msgs/msg/Tensor message with a single uint8[] data field. In the example above, msg::MessageWithTensor would have a torch_msgs/msg/Tensor tensor field, and you’d instead do this:

void topic_callback(const msg::MessageWithTensor & input_msg) {
  torch::Tensor input_tensor =
    torch_backend::from_buffer(input_msg.tensor.data);

Then it’s clear to any node that subscribes to this topic that this buffer is a Torch tensor. Does this make sense?

So there’s 2 parts to the answer.

One is that the buffer class has a type field, which is essentially just a string. For a Torch buffer, it would literally say “torch”, and for a CUDA buffer, it would say “CUDA”. For a Toch buffer dependent on a CUDA buffer, there are 2 buffers, one of type “torch” and the other of type “cuda”. There’s internal checks to make sure that everything is compatible, and as a user, one usually doesn’t have to worry about that.

The other is that yes, as you suggest, I would also think that over time, an ecosystem evolves where “typing” is also done at a higher level. The taxonomy for that is really up to the community. A torch_msgs/msg/Tensor as you suggest would be one approach. Another would be a std_msgs/Tensor that could map to any frameworks tensor, including Torch, JAX or some new framework that pops up 6 months from now. Either approach would be conceivable. Since they are tensors, one could imagine a high level conversion between them, preserving semantics while going from one framework to another. This is however out of scope of our current proposal, and something that could be built on top later.

2 Likes