We propose introducing the concept of a tensor as a natively supported type in ROS 2 Lyrical Luth. Below is a sketch of how this would work for initial feedback before we write a proper REP for review.
Abstract
Tensors are a fundamental data structure often used to represent multi-modal information for deep neural networks (DNNs) at the core of policy-driven robots. We introduce rcl::tensor as a native type in rcl, as a container for memory that can be optionally externally managed. This type would be supported through all client libraries (rclcpp, rclpy, β¦) the ROS IDL rosidl, and all RMW implementations. This enables tensor_msgs ROS messages based on sensor_msgs which use tensor instead of uint8[]. The default implementation of rcl::tensor operations for creation/destruction and manipulation will be available on all tiers of supported platforms.. With the presence of an optional package and an environment variable, a platform-optimized implementation for rcl::tensor operations can then be swapped in at runtime to take advantage of accelerator-managed memory/compute. Through adoption of rcl::tensor in developer code and ROS messages, we can enable seamless platform-specific acceleration determined at runtime without any recompilation or deployment.
Motivation
ROS 2 should be accelerator-aware but accelerator-agnostic like other popular frameworks such as PyTorch or NumPy. This enables package developers that conform to ROS 2 standards to gain platform-specific optimizations for free (βoptimal where possible, compatible where necessaryβ).
Background
AI robots and policy-driven physical agents rely on accelerated deep neural network (DNN) model inference through tensors. Tensors are a fundamental data structure to represent multi-dimensional data from scalar (rank 0), vectors (rank 1), and matrices (rank 2) to batches of multi-channel matrices (rank 4). These can be used to encode all data flowing through such graphs including images, text, joint positions, poses, trajectories, IMU readings, and more.
Performing inference on these DNN model policies requires these tensors to reside in accelerator memory. ROS messages, however, expect their payloads to reside in main memory with field types such as uint8[] or multi-dimensional arrays. This requires these payloads to be copied from main memory to accelerator memory and then copied back to main memory after processing in order to populate a new ROS message to publish. This quickly becomes the primary bottleneck for policy inference. Type adaptation in rclcpp provides a solution for this, but it requires all participating packages to have accelerator-specific dependencies and only applies within the client library, so RMW implementations cannot apply optimized-for-accelerator memory, for example.
Additionally, without a canonical tensor type in ROS 2, a patchwork of different tensor libraries across various ROS packages is causing impedance mismatches with popular deep learning frameworks including PyTorch.
Requirements
- Provide a native way to represent tensors across all interfaces from client libraries through RMW implementations.
- Make available a set of common operations on tensors that can be used by all interfaces.
- Enable accelerated implementations of common tensor operations when available at runtime.
- Enable accelerator memory management backing these tensors when available at runtime.
- Optimize flow of tensors for deep neural network (DNN) model inference to avoid unnecessary memory copies.
- Allow for backwards compatibility with all non-accelerated platforms.
Rough Sketch
struct rcl::tensor
{
std::vector<size_t> shape; // shape of the tensor
std::vector<size_t> strides; // strides of the tensor
size_t rank; // number of dimensions
union {
void* data; // pointer to the data in memory handle
size_t handle; // token stored by rcl::tensor for externally managed memory
}
size_t byte_size; // size of the data
data_type_enum type; // the data type
}
Core Tensor APIs
Inline APIs available on all platforms in core ROS 2 rcl.
Creation
Create a new tensor from main memory.
rcl_tensor_create_copy_from_bytes(const void *data_ptr, size_t byte_size, data_type_enum type)rcl_tensor_wrap_bytes(void *data_ptr, size_t size, data_type_enum type)rcl_tensor_create_copy_from(const struct rcl::tensor & tensor)
Common operations
Manipulations performed on tensors that can be optionally accelerated. The more complete these APIs are, the less fragmented the ecosystem will be but the higher the burden on implementers. These should be modeled after PyTorch tensor API and existing C tensor libraries such as libXM or C++ libraries like xtensor.
reshape()squeeze()normalize()fill()zero()- β¦
Managed access
Provide a way to access elements individually in parallel.
rcl_tensor_apply(<functor on each element with index>)
Direct access
Retrieve the underlying data in main memory but may involve movement of data.
void* rcl_tensor_materialized_data()
Other Conveniences
rclfunctions to check which tensor implementation is active.tensor_msgs::Imageto mirrorsensor_msgs::Imageto enable smooth migration to usingtensortype in common ROS messages. Alternative is to add a βunionβ field insensor_msgs::Imagewith theuint8[]datafield.cv_bridgeAPI to convert betweencv::Matandtensor_msgs::Image.
Platform-specific tensor implementation
Without loss of generality, suppose we have an implementation of tensor that uses an accelerated library, such as rcl_tensor_cuda for CUDA. This package provides shared libraries that implement all of the core tensor APIs. An environment variable for RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda enables the loading of rcl_tensor_cuda at runtime without rebuilding any other packages. Unlike the native implementation, rcl_tensor_cuda copies the input buffer into a CUDA buffer and uses CUDA to perform operations on that CUDA buffer.
It also provides new APIs for creating a tensor from a CUDA buffer, for checking whether the rcl_tensor_cuda implementation is active, and for accessing the CUDA buffer from a tensor available for any other package libraries the link to rcl_tensor_cuda directly. An RMW implementation linked against rcl_tensor_cuda would query the CUDA buffer backing a tensor and use optimized transport paths to handle it, while a general RMW implementation could just call rcl_tensor_materialize_bytes and transport the main memory payload as normal.
Simple Examples
Example #1: rcl::tensor with βaccelerator-awareβ subscriber
Node A publishes a ROS message with rcl::tensor from main memory bytes and sends it to a topic Node B subscribes to. Node B happens to be written to first check whether the rcl::tensor is backed by externally managed memory AND checks that rcl_tensor_cuda is active (indicates this is backed by CUDA). Node B has a direct dependency on rcl_tensor_cuda in order to perform this check.
Alternatively, Node B could have also been written with no dependency on any rcl::tensor implementation to simply retrieve the bytes from the rcl::tensor and ignore the externally managed memory flag altogether, which would have forced a copy back from accelerator memory in Scenario 2.
MyMsg.msg
β--------
std_msgs/Header header
tensor payload
Scenario 1: RCL_TENSOR_IMPLEMENTATION = <none>
----------------------------------------------
βββββββββββββββββββ ROS Message βββββββββββββββββββ
β Node A β βββββββββββββββββΊ β Node B β
β β β β
β βββββββββββββββ β β βββββββββββββββ β
β βCreate Tensorβ β β βReceive MyMsgβ β
β βin MyMsg β β β β β β
β βββββββββββββββ β β βββββββββββββββ β
β β β β β β
β βΌ β β βΌ β
β βββββββββββββββ β β βββββββββββββββ β
β βPublish β β β βCheck if β β
β βMyMsg β β β βExternally β β
β βββββββββββββββ β β βManaged β β
βββββββββββββββββββ β βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β βCopy β β
β βto Accel Mem β β
β βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β βProcess on β β
β βAccelerator β β
β βββββββββββββββ β
βββββββββββββββββββ
Scenario 2: RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda
--------------------------------------------------------
βββββββββββββββββββ ROS Message βββββββββββββββββββ
β Node A β βββββββββββββββββΊ β Node B β
β β β β
β βββββββββββββββ β β βββββββββββββββ β
β βCreate Tensorβ β β βReceive MyMsgβ β
β βin MyMsg β β β β β β
β βββββββββββββββ β β βββββββββββββββ β
β β β β β β
β βΌ β β βΌ β
β βββββββββββββββ β β βββββββββββββββ β
β βPublish MyMsgβ β β βCheck if β β
β βββββββββββββββ β β βExternally β β
βββββββββββββββββββ β βManaged β β
β βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β βProcess on β β
β βAccelerator β β
β βββββββββββββββ β
βββββββββββββββββββ
In Scenario 2, the same tensor function call in Node A creates a tensor backed by accelerator memory instead. This allows Node B, which was checking for a rcl_tensor_cuda-managed tensor to skip the extra copy.
Example #2: CPU versus accelerated implementations
SCENARIO 1: RCL_TENSOR_IMPLEMENTATION = <none> (CPU/Main Memory Path)
========================================================================
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CPU/Main Memory Path β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Create β β Normalize β β Reshape β β Materialize β
β Tensor βββββΆβ Operation βββββΆβ Operation βββββΆβ Bytes β
β [CPU Mem] β β [CPU] β β [CPU] β β [CPU Mem] β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Allocate β β CPU-based β β CPU-based β β Return β
β main memory β β normalize β β reshape β β pointer to β
β for tensor β β computation β β computation β β byte array β
β data β β on CPU β β on CPU β β in main mem β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
Memory Layout:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Main Memory β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Tensor β β Normalized β β Reshaped β β Materializedβ β
β β Data β β Tensor β β Tensor β β Bytes β β
β β [CPU] β β [CPU] β β [CPU] β β [CPU] β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SCENARIO 2: RCL_TENSOR_IMPLEMENTATION = rcl_tensor_cuda (GPU/CUDA Path)
=======================================================================
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU/CUDA Path β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Create β β Normalize β β Reshape β β Materialize β
β Tensor βββββΆβ Operation βββββΆβ Operation βββββΆβ Bytes β
β [GPU Mem] β β [CUDA] β β [CUDA] β β [CPU Mem] β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Allocate β β CUDA kernel β β CUDA kernel β β Copy from β
β GPU memory β β for normalizeβ β for reshape β β GPU to CPU β
β for tensor β β computation β β computation β β memory β
β data β β on GPU β β on GPU β β (cudaMemcpy)β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
Memory Layout:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU Memory β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Tensor β β Normalized β β Reshaped β β
β β Data β β Tensor β β Tensor β β
β β [GPU] β β [GPU] β β [GPU] β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Main Memory β
β β
β β
β βββββββββββββββ β
β β Materializedβ β
β β Bytes β β
β β [CPU] β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IMPLEMENTATION NOTES
===================
β’ Environment variable RCL_TENSOR_IMPLEMENTATION controls which path is taken
β’ Same API calls work in both scenarios (transparent to user code)
β’ GPU path requires CUDA runtime and rcl_tensor_cuda package
β’ Memory management handled automatically by implementation
β’ Backward compatibility maintained for CPU-only systems
Discussion Questions
-
Should we constrain tensor creation functions to using memory allocators instead?
rcl::tensorimplementations would need to provide custom memory allocators for externally managed memory, for example. -
Do we allow for mixed runtimes of cpu-backed/external memory managed tensors in one runtime? What creation pattern would allow for precompiled packages to βpick upβ accelerated memory dynamically at runtime by default but also explicitly opt-out from it for specific tensors as well?
-
Do we need to expose the concept of βstreamsβ and βdevicesβ through the
rcl::tensorAPI or can that be kept under the abstraction layer? They are generic concepts but may too strongly proscribe the underlying implementation. However, exposing them would let developers provide stronger intent on how they want their code to be executed in an accelerator-agnostic manner. -
What common tensor operations should we keep as supported? The more we choose, the higher the burden on the
rcl::tensorimplementations, but the more standardized and less fragmented our ROS 2 developer base. For example, we do not want fragmentation where packages begin to depend onrcl_tensor_cudaand thus fallback only to CPU forrcl_tensor_opencl(wlog). -
Should tensors have a multi-block interfaces from the get-go? Assuming one memory address seems problematic for rank 4 tensors, for example (e.g., sets of images from multiple cameras).
-
Should the ROS 2 canonical implementation of
rcl::tensorbe inline or based on an existing, open source library? If so, which one?
Summary
tensoras a native type inrcland made available through all client libraries, ROS IDL, and all RMW implementations, likestring arrayoruint8[].tensor_msgs::Imageissensor_msgs::Imagebut withtensorpayload instead ofuint8[].- Add
cv_bridgefunctions to createtensor_msgs::Imagefromcv2::Matto spur adoption.
- Implementations for
tensorlifecycle and manipulation can be dynamically swapped at runtime with a package and an environment variable.- Data for tensors can then be optionally stored in externally managed memory, eliminating need for type adaptation in
rclcpp. - Operations on tensors can then be optionally implemented with accelerated libraries.
- Data for tensors can then be optionally stored in externally managed memory, eliminating need for type adaptation in