Native rcl::tensor type

HemalShahNV · August 8, 2025, 11:41pm

We propose introducing the concept of a tensor as a natively supported type in ROS 2 Lyrical Luth. Below is a sketch of how this would work for initial feedback before we write a proper REP for review.

Abstract

Tensors are a fundamental data structure often used to represent multi-modal information for deep neural networks (DNNs) at the core of policy-driven robots. We introduce rcl::tensor as a native type in rcl, as a container for memory that can be optionally externally managed. This type would be supported through all client libraries (rclcpp, rclpy, …) the ROS IDL rosidl, and all RMW implementations. This enables tensor_msgs ROS messages based on sensor_msgs which use tensor instead of uint8[]. The default implementation of rcl::tensor operations for creation/destruction and manipulation will be available on all tiers of supported platforms.. With the presence of an optional package and an environment variable, a platform-optimized implementation for rcl::tensor operations can then be swapped in at runtime to take advantage of accelerator-managed memory/compute. Through adoption of rcl::tensor in developer code and ROS messages, we can enable seamless platform-specific acceleration determined at runtime without any recompilation or deployment.

Motivation

ROS 2 should be accelerator-aware but accelerator-agnostic like other popular frameworks such as PyTorch or NumPy. This enables package developers that conform to ROS 2 standards to gain platform-specific optimizations for free (“optimal where possible, compatible where necessary”).

Background

AI robots and policy-driven physical agents rely on accelerated deep neural network (DNN) model inference through tensors. Tensors are a fundamental data structure to represent multi-dimensional data from scalar (rank 0), vectors (rank 1), and matrices (rank 2) to batches of multi-channel matrices (rank 4). These can be used to encode all data flowing through such graphs including images, text, joint positions, poses, trajectories, IMU readings, and more.

Performing inference on these DNN model policies requires these tensors to reside in accelerator memory. ROS messages, however, expect their payloads to reside in main memory with field types such as uint8[] or multi-dimensional arrays. This requires these payloads to be copied from main memory to accelerator memory and then copied back to main memory after processing in order to populate a new ROS message to publish. This quickly becomes the primary bottleneck for policy inference. Type adaptation in rclcpp provides a solution for this, but it requires all participating packages to have accelerator-specific dependencies and only applies within the client library, so RMW implementations cannot apply optimized-for-accelerator memory, for example.

Additionally, without a canonical tensor type in ROS 2, a patchwork of different tensor libraries across various ROS packages is causing impedance mismatches with popular deep learning frameworks including PyTorch.

Requirements

Provide a native way to represent tensors across all interfaces from client libraries through RMW implementations.
Make available a set of common operations on tensors that can be used by all interfaces.
Enable accelerated implementations of common tensor operations when available at runtime.
Enable accelerator memory management backing these tensors when available at runtime.
Optimize flow of tensors for deep neural network (DNN) model inference to avoid unnecessary memory copies.
Allow for backwards compatibility with all non-accelerated platforms.

Rough Sketch

struct rcl::tensor
{
    std::vector<size_t> shape; // shape of the tensor
    std::vector<size_t> strides; // strides of the tensor
    size_t rank; // number of dimensions

    union {
        void* data; // pointer to the data in memory handle
        size_t handle; // token stored by rcl::tensor for externally managed memory
    }
    size_t byte_size; // size of the data

    data_type_enum type; // the data type
}

Core Tensor APIs

Inline APIs available on all platforms in core ROS 2 rcl.

Creation

Create a new tensor from main memory.

rcl_tensor_create_copy_from_bytes(const void *data_ptr, size_t byte_size, data_type_enum type)
rcl_tensor_wrap_bytes(void *data_ptr, size_t size, data_type_enum type)
rcl_tensor_create_copy_from(const struct rcl::tensor & tensor)

Common operations

Manipulations performed on tensors that can be optionally accelerated. The more complete these APIs are, the less fragmented the ecosystem will be but the higher the burden on implementers. These should be modeled after PyTorch tensor API and existing C tensor libraries such as libXM or C++ libraries like xtensor.

reshape()
squeeze()
normalize()
fill()
zero()
…

Managed access

Provide a way to access elements individually in parallel.

rcl_tensor_apply(<functor on each element with index>)

Direct access

Retrieve the underlying data in main memory but may involve movement of data.

void* rcl_tensor_materialized_data()

Other Conveniences

rcl functions to check which tensor implementation is active.
tensor_msgs::Image to mirror sensor_msgs::Image to enable smooth migration to using tensor type in common ROS messages. Alternative is to add a “union” field in sensor_msgs::Image with the uint8[] data field.
cv_bridge API to convert between cv::Mat and tensor_msgs::Image.

Platform-specific tensor implementation

Without loss of generality, suppose we have an implementation of tensor that uses an accelerated library, such as rcl_tensor_cuda for CUDA. This package provides shared libraries that implement all of the core tensor APIs. An environment variable for RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda enables the loading of rcl_tensor_cuda at runtime without rebuilding any other packages. Unlike the native implementation, rcl_tensor_cuda copies the input buffer into a CUDA buffer and uses CUDA to perform operations on that CUDA buffer.

It also provides new APIs for creating a tensor from a CUDA buffer, for checking whether the rcl_tensor_cuda implementation is active, and for accessing the CUDA buffer from a tensor available for any other package libraries the link to rcl_tensor_cuda directly. An RMW implementation linked against rcl_tensor_cuda would query the CUDA buffer backing a tensor and use optimized transport paths to handle it, while a general RMW implementation could just call rcl_tensor_materialize_bytes and transport the main memory payload as normal.

Simple Examples

Example #1: rcl::tensor with “accelerator-aware” subscriber

Node A publishes a ROS message with rcl::tensor from main memory bytes and sends it to a topic Node B subscribes to. Node B happens to be written to first check whether the rcl::tensor is backed by externally managed memory AND checks that rcl_tensor_cuda is active (indicates this is backed by CUDA). Node B has a direct dependency on rcl_tensor_cuda in order to perform this check.

Alternatively, Node B could have also been written with no dependency on any rcl::tensor implementation to simply retrieve the bytes from the rcl::tensor and ignore the externally managed memory flag altogether, which would have forced a copy back from accelerator memory in Scenario 2.

MyMsg.msg
—--------
std_msgs/Header header
tensor payload

Scenario 1: RCL_TENSOR_IMPLEMENTATION = <none>
----------------------------------------------

┌─────────────────┐    ROS Message    ┌─────────────────┐
│   Node A        │ ────────────────► │   Node B        │
│                 │                   │                 │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Create Tensor│ │                   │ │Receive MyMsg│ │
│ │in MyMsg     │ │                   │ │             │ │
│ └─────────────┘ │                   │ └─────────────┘ │
│         │       │                   │         │       │
│         ▼       │                   │         ▼       │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Publish      │ │                   │ │Check if     │ │
│ │MyMsg        │ │                   │ │Externally   │ │
│ └─────────────┘ │                   │ │Managed      │ │
└─────────────────┘                   │ └─────────────┘ │
                                      │         │       │
                                      │         ▼       │
                                      │ ┌─────────────┐ │
                                      │ │Copy         │ │
                                      │ │to Accel Mem │ │
                                      │ └─────────────┘ │
                                      │          │       │
                                      │         ▼       │
                                      │ ┌─────────────┐ │
                                      │ │Process on   │ │
                                      │ │Accelerator  │ │
                                      │ └─────────────┘ │
                                      └─────────────────┘

Scenario 2: RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda
--------------------------------------------------------

┌─────────────────┐    ROS Message    ┌─────────────────┐
│   Node A        │ ────────────────► │   Node B        │
│                 │                   │                 │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Create Tensor│ │                   │ │Receive MyMsg│ │
│ │in MyMsg     │ │                   │ │             │ │
│ └─────────────┘ │                   │ └─────────────┘ │
│         │       │                   │         │       │
│         ▼       │                   │         ▼       │
│ ┌─────────────┐ │                   │ ┌─────────────┐ │
│ │Publish MyMsg│ │                   │ │Check if     │ │
│ └─────────────┘ │                   │ │Externally   │ │
└─────────────────┘                   │ │Managed      │ │
                                      │ └─────────────┘ │
                                      │         │       │
                                      │         ▼       │
                                      │ ┌─────────────┐ │
                                      │ │Process on   │ │
                                      │ │Accelerator  │ │
                                      │ └─────────────┘ │
                                      └─────────────────┘

In Scenario 2, the same tensor function call in Node A creates a tensor backed by accelerator memory instead. This allows Node B, which was checking for a rcl_tensor_cuda-managed tensor to skip the extra copy.

Example #2: CPU versus accelerated implementations

SCENARIO 1: RCL_TENSOR_IMPLEMENTATION = <none> (CPU/Main Memory Path)
========================================================================

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CPU/Main Memory Path                           │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Create    │    │  Normalize  │    │   Reshape   │    │ Materialize │
│   Tensor    │───▶│  Operation  │───▶│  Operation  │───▶│    Bytes    │
│  [CPU Mem]  │    │   [CPU]     │    │   [CPU]     │    │  [CPU Mem]  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
        │                   │                   │                   │
        ▼                   ▼                   ▼                   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Allocate    │    │ CPU-based   │    │ CPU-based   │    │ Return      │
│ main memory │    │ normalize   │    │ reshape     │    │ pointer to  │
│ for tensor  │    │ computation │    │ computation │    │ byte array  │
│ data        │    │ on CPU      │    │ on CPU      │    │ in main mem │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Memory Layout:
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Main Memory                                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Tensor    │  │  Normalized │  │  Reshaped   │  │ Materialized│         │
│  │   Data      │  │   Tensor    │  │   Tensor    │  │    Bytes    │         │
│  │  [CPU]      │  │   [CPU]     │  │   [CPU]     │  │   [CPU]     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────────────────────┘

SCENARIO 2: RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda (GPU/CUDA Path)
=======================================================================

┌─────────────────────────────────────────────────────────────────────────────┐
│                              GPU/CUDA Path                                  │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Create    │    │  Normalize  │    │   Reshape   │    │ Materialize │
│   Tensor    │───▶│  Operation  │───▶│  Operation  │───▶│    Bytes    │
│  [GPU Mem]  │    │   [CUDA]    │    │   [CUDA]    │    │  [CPU Mem]  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
        │                   │                   │                   │
        ▼                   ▼                   ▼                   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Allocate    │    │ CUDA kernel │    │ CUDA kernel │    │ Copy from   │
│ GPU memory  │    │ for normalize│   │ for reshape │    │ GPU to CPU  │
│ for tensor  │    │ computation │    │ computation │    │ memory      │
│ data        │    │ on GPU      │    │ on GPU      │    │ (cudaMemcpy)│
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Memory Layout:
┌─────────────────────────────────────────────────────────────────────────────┐
│                              GPU Memory                                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                          │
│  │   Tensor    │  │  Normalized │  │  Reshaped   │                          │
│  │   Data      │  │   Tensor    │  │   Tensor    │                          │
│  │  [GPU]      │  │   [GPU]     │  │   [GPU]     │                          │
│  └─────────────┘  └─────────────┘  └─────────────┘                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Main Memory                                    │
│                                                                             │
│                                                                             │
│  ┌─────────────┐                                                            │
│  │ Materialized│                                                            │
│  │    Bytes    │                                                            │
│  │   [CPU]     │                                                            │
│  └─────────────┘                                                            │
└─────────────────────────────────────────────────────────────────────────────┘

IMPLEMENTATION NOTES
===================

• Environment variable RCL_TENSOR_IMPLEMENTATION controls which path is taken
• Same API calls work in both scenarios (transparent to user code)
• GPU path requires CUDA runtime and rcl_tensor_cuda package
• Memory management handled automatically by implementation
• Backward compatibility maintained for CPU-only systems

Discussion Questions

Should we constrain tensor creation functions to using memory allocators instead? rcl::tensor implementations would need to provide custom memory allocators for externally managed memory, for example.
Do we allow for mixed runtimes of cpu-backed/external memory managed tensors in one runtime? What creation pattern would allow for precompiled packages to “pick up” accelerated memory dynamically at runtime by default but also explicitly opt-out from it for specific tensors as well?
Do we need to expose the concept of “streams” and “devices” through the rcl::tensor API or can that be kept under the abstraction layer? They are generic concepts but may too strongly proscribe the underlying implementation. However, exposing them would let developers provide stronger intent on how they want their code to be executed in an accelerator-agnostic manner.
What common tensor operations should we keep as supported? The more we choose, the higher the burden on the rcl::tensor implementations, but the more standardized and less fragmented our ROS 2 developer base. For example, we do not want fragmentation where packages begin to depend on rcl_tensor_cuda and thus fallback only to CPU for rcl_tensor_opencl (wlog).
Should tensors have a multi-block interfaces from the get-go? Assuming one memory address seems problematic for rank 4 tensors, for example (e.g., sets of images from multiple cameras).
Should the ROS 2 canonical implementation of rcl::tensor be inline or based on an existing, open source library? If so, which one?

Summary

tensor as a native type in rcl and made available through all client libraries, ROS IDL, and all RMW implementations, like string array or uint8[].
- tensor_msgs::Image is sensor_msgs::Image but with tensor payload instead of uint8[].
- Add cv_bridge functions to create tensor_msgs::Image from cv2::Mat to spur adoption.
Implementations for tensor lifecycle and manipulation can be dynamically swapped at runtime with a package and an environment variable.
- Data for tensors can then be optionally stored in externally managed memory, eliminating need for type adaptation in rclcpp.
- Operations on tensors can then be optionally implemented with accelerated libraries.

mjcarroll · August 11, 2025, 1:33pm

Thanks for posting this (and on a Friday evening, no less ). I haven’t had a chance to fully digest the proposal, but it does seem overall like a gap that we have in the ROS ecosystem for these types of applications.

Looking forward to giving it a deeper read and giving feedback.

JM_ROS · August 12, 2025, 10:50am

Short feedback, I don’t know if making this part of rcl is a good idea, as rcl is C. Perhaps putting this in a new package rcl_tensor that uses C++ internally, but provides a C API makes more sense.

peci1 · August 12, 2025, 11:43am

Some time ago there were discussions about type adaptation in ROS 2. Isn’t that a framework that would basically allow everything proposed here?

vrichard · August 13, 2025, 1:50am

I don’t see the point of making tensor a RCL primitive. Your example implementation could just be a ROS message:

HemalShahNV:

struct rcl::tensor
{
    std::vector<size_t> shape; // shape of the tensor
    std::vector<size_t> strides; // strides of the tensor
    size_t rank; // number of dimensions

    union {
        void* data; // pointer to the data in memory handle
        size_t handle; // token stored by rcl::tensor for externally managed memory
    }
    size_t byte_size; // size of the data

    data_type_enum type; // the data type
}

and the several utils you propose could just be part of a new library (e.g. ros2_tensor/ros2_tensor_vendor_cuda).

There is for example no rcl::image nor rcl::pointcloud, and yet it does not prevent you from creating a cv::Mat or pcl::PointCloud from Image and PointCloud2 messages.

If you want to avoid unecessary copies, you can always leverage intra-process communication to have your ROS modules re-use the same memory.

I understand one key feature you would like to have is to be able to share accelerator memory. For example, node A would upload a tensor to GPU 0, pass its handle to node B which would then trigger some processing and maybe a node C would collect the results. AFAIK, sharing accelerator memory is not as easy as passing a pointer: you would also have to share the whole accelerator context, which is likely not thread safe so you would also need to implement a synchronization mechanism between all your nodes. But even if that is technically possible, what is the point? Since RAM is so much easier to share, it would be way simpler for node A to send input data to node B using intra-process communication, have node B upload the data, trigger the processing, collect the results and share it to node C, again using intra-processing communication.

CursedRock17 · August 13, 2025, 4:13am

I definitely believe it’s vital to bring up the concept of tensor integration with all of the ML stuff becoming more prominent in the industry. But I do have a lot of doubts about the need to integrate rcl::tensor into ROS’ core libraries. As mentioned:

That seems to be the most justifiable reason as to not integrate directly, just like how there are no data containers: rcl::vector or rcl::span. Why increase bloat, especially for those who will not need to use it?

That being said, it would be very beneficial to the entire community to have a separate library with things like tensors in mind, allowing easier integration of neural networks. This is amplified by the fact we could use (for the C++ side) TypeAdaptations, which have a big boost in performance. This would work well for external interfaces which use PyTorch, LibTorch or another already very optimized library.

Certainly seems necessary to give users that option as it’s been the standard for many other extended types and could be specific to hardware too.

Yes here’s the Original PR & corresponding REP 2007

mhubii · August 13, 2025, 8:22am

how does ros_isaac_nitros fit into this picture? GitHub - NVIDIA-ISAAC-ROS/isaac_ros_nitros: NVIDIA Isaac Transport for ROS package for hardware-acceleration friendly movement of messages

HemalShahNV · August 22, 2025, 9:51pm

Thanks, everyone, for taking a look and sharing your thoughts!

Adding this type to rcl with a C API rather than C++ will be more problematic for sure, but is there a better way to enable the type is available across rmw and client libraries? Can we add a C++ library with C interfaces into rcl?

There was a suggestion to use custom IDL generation to map uint8[] in a ROS message to a memory handle in the runtime struct which seems compelling.

Type adaptation definitely gets us 90% of the way there from a technical perspective for sure. NVIDIA helped fund improvements to type adaptation in ROS 2 Humble and we used it as the basis of our NITROS to do exactly as described. However, having the RMW implementation be able to access the runtime struct before it gets converted to a ROS message is not there, so this remains locked firmly in intraprocess in rclcpp.

There is no rcl::image, true, but rcl::tensor is a fundamental data type by itself (natural extension to multi-dimensional array) and we do seem to have uint8[] and 'array-of-stringtypes inrcl` already. The construction, manipulation, and flow of tensors from sensors to DNN inference is the pattern we need to optimize for. Using a separate library for tensors is workable, but that would just lead to fragmentation, just as type adaptation can.

The goal as I see it is not just to solve the straightforward “zero-copy” technical challenge here. As mentioned, NITROS already does this, but only for packages specifically developed for NVIDIA platforms. Instead, what if ROS 2, like Pytorch and others, provided an interface and abstraction layer complete enough that a developer could build their packages against it and still benefit from accelerated optimizations available on the platform at runtime?

rcl::tensor is a proposal to help us along that path with tangible benefits, but open to other ideas of course.

hidmic · September 1, 2025, 2:11pm

This is an interesting proposal. I’m in favor of improving rosidl messaging, but I do have some questions and concerns (or just some ramblings I thought I’d share).

I take this is laying the ground for an rmw implementation that can pull off NITROS like transport optimizations for GPU tensors (type negotiation, memory handles, etc.). Let’s assume that can be done across processes, accounting for contexts and lifetimes. How does integration with other rmw implementations look like? Not all data is tensor data, and a ROS 2 system may span multiple machines. Does the path lead to more generated typesupport code for every relevant language times every implementation that wants the boost? I ask because tensor data is not the only form of data susceptible to transport optimization (someone mentioned images, I’d add video). If we are heading that way, I wonder (and I have been wondering for a while now) if it is time for some standardization of in-memory layouts and wire formats for rosidl messages. It’d help code reuse for sure (and a number of other things).
While I do see the value of a unified interface for tensors, adding yet another library for tensor manipulation seems unwarranted. There are plenty of tensor libraries out there already. Some are well supported and widely adopted. Is it really necessary to couple behavior with data? Couldn’t it be just some standard form of tensor data in messages? Sure, allocation and deallocation will need special handling, but otherwise users can do as they please with it. They can use xtensor::adapt, or torch::from_blob, or feed it to a tensorflow::Tensor as a TensorBuffer, or whatever makes sense for them. Tensor metadata can help pull up the right backend.

ZhenshengLee · September 4, 2025, 6:06am

Any idea to create a PR in ros-infrastructure/rep: ROS Enhancement Proposals ?

Topic		Replies	Views
Reducing ROS 2 CPU overhead by simplifying the ROS 2 layers ROS General	11	8167	October 15, 2021
C++ AnySubscriber and AnyPublisher ROS Babel Fish now available ROS General release	32	2325	October 9, 2019
IPC in ros2 ROS General ros2	39	12421	March 20, 2018
Challenges of GPU acceleration in ROS ROS General jetson	19	8727	April 29, 2021
ROS2 on microcontrollers with RIOT ROS General	27	9046	September 28, 2018