Is it time to discuss rosidl?

TL; DR I am of the opinion that some of the performance bottlenecks we still see in ROS 2 can be traced back to rosidl design. There are benefits to language-specific runtime representations and vendor-specific serialization formats, but efficiency is not one of them. Other designs may be better suited to the kind of data streams that are common in robotics. In that sense, GitHub - Ekumen-OS/flatros2 may be an interesting conversation starter.


Howdy! I don’t post often on ROS Discourse but I thought this may be worthwhile. The Physical AI rebranding of robotics is drawing attention and resources, and in that whirlwind I keep seeing new libraries and frameworks showcasing performance figures that seemingly obliterate those
of ROS 2 (like dora-rs/dora-benchmark, but there are others). Why? The total amount of engineering resources invested by this community in ROS 2 far exceeds that of any other new project and yet I still find myself second guessing ros2 topic hz output. Well, I’ve been around and about for a good while now, and I have a hypothesis.

rosidl is one the oldest corners of ROS 2. C and C++ message generators were first released with Alpha 1, each with their own runtime representations: simple enough, language-specific. The first few DDS based middlewares (like opensplice) had their own vendor-specific IDL to comply with, for interoperability and full feature support, and so type support code and internal runtime representations had to be generated. A decade later ROS 2 is still bound by this design.

Zero-copy transports cannot cross the language boundary because there’s no common runtime representation, and because in-memory layouts are nonlocal, even for the same language their scope of application is extremely narrow (so narrow not even standard messages qualify). Middlewares (de)serialize messages to vendor-specific formats that keep them traceable on their domain and afford us features like keyed topics, whether that makes sense for a given form of data or not. Repeated (de)serialization of images and pointclouds (and other forms of multi-dimensional data) certainly does not help speed.

I honestly didn’t know if there was a way out of this. Some of these shortcomings cannot be fixed out of tree. So I started Ekumen-OS/flatros2 with some Ekumen colleagues as an experiment. It turns out there are lots of pitfalls and limitations but there is a way. Ekumen-OS/flatros2 is NOT it, however. A true solution (I believe) must be a core solution, and Ekumen-OS/flatros2 is just an exercise on how that future solution may look like i.e. messages as language-specific views to contiguous memory layouts, bounded on write, unbounded on read. The choice of flatbuffers and iceoryx2 was informed but arbitrary. Both have interesting properties nonetheless.

Hope this helps kickstart a discussion. There’s no fundamental reason why ROS 2 cannot perform near I/O limits. And who knows, maybe there’s enough momentum to sort out message compabitility too while we are at it (and I’d very much appreciate backwards and forward compatible bags).

15 Likes

Thanks @hidmic, and always good to see you around :slight_smile:

I’m going to put this on the PMC agenda for next week, but would be happy to see any other input here before we discuss it in person.

It turns out there are lots of pitfalls and limitations but there is a way.

Can you expand on the pitfalls/limitations here?

1 Like

I really like this thinking.

2 Likes

Glad to see this work out in the open, what good timing :wink:!

From our perspective flatbuffers is also very compelling on the RTOS side (using with zenoh pico) . I personally have already played with a “hacky” rosidl_adapter to limit the buffer size of a string for embedded as a [ubyte:64] struct (for static allocation of a given FBS with flatcc). I’m excited to see where this goes, I do think there is a lot of potential performance gains being left on the table when comparing where we are currently at with other less large/newer projects.

I’m also quite curious about this. However, it would have to be proven that the serdes savings are not bought by runtime inefficiecies (i.e. losing vectorization when not using language/library-native memory layout).

Hey @hidmic! Thanks for bringing it up.

So, what’s the proposal? If I understand you correctly, your theory is that rosidl is responsible for a significant part of the gap between ROS 2 and other frameworks, which seems plausible to me, but what part should we change?

Are the C/C++ data structures generated from the type definitions the problem that needs to be addressed? (e.g. rosidl/rosidl_generator_cpp/resource/msg__struct.hpp.em at 30877f52f5f19902bedb89f67bf4bafb2c6eae12 · ros2/rosidl · GitHub for C++?)

Or is it the need to do zero-copy?

Of course these things are related, but I’m trying to understand what we could concretely do to improve this. I think zero-copy between processes using the same language is possible (https://docs.ros.org/en/humble/How-To-Guides/Configure-ZeroCopy-loaned-messages.html) using loaned messages and even with the standard rosidl c++ structs if you’re using plain old data (no strings or sequences). To do this between different languages you would need something like flatbuffers. And if you want to use something like flatbuffers or capnproto or the like, then I think you could do that by adding additional rosidl generators. We have an “official” one for python that provides simple Python objects similar to the ones in ROS 1 based on the message definition, you could have additional ones for C++ and Python that present different data structures (think #include “sensor_msgs/msg_flatbuffer/image.hpp rather than #include “sensor_msgs/msg/image.hpp. That would allow users to use these other data structures for any user defined message type, but if you want it to be very efficient then the middleware needs to understand these new types, otherwise you’re limited to copying from your preferred user facing type to the type the middleware understands or serializing it to the wire format that the middleware understands, neither of which are particularly efficient nor do they lend themselves to zero-copy.

But even in those cases you’d still have the rosidl pipeline (type definitions → machine readable type definition → language or serialization library specific code).

It looks like you’ve avoided the need for a new rosidl_generator_flatbuffer-like packages in your flatros2 PoC by using some reflection (a la flatros2/flatros2/include/flatros2/message.hpp at 8a8ad51ffbe363c8e4d8909b548de075d4b26ceb · Ekumen-OS/flatros2 · GitHub) and you’re using rosidl_typesupport_introspection_cpp to handle support in the middleware, which is nice because it’s above the rosidl/rmw level mostly. However, to gain more optimizations, or for a marshaling library like protobuf or arrow (or even to use flatbuffer better), you’d probably want some build-time step, which is where a rosidl_generator_cpp_X/rosidl_runtime_cpp_X/rosidl_typesupport_cpp_X like set of packages would come in. With that in mind, I guess I don’t know which parts of rosidl need to change, because it seems like, at least in theory, it should be possible to solve these performance problems.

So is the proposal just to build some of the packages I described above, or is it to change the “rosidl pipeline” somehow? Or is the conversation more about changing the defaults in some of these cases, in addition to building the alternatives in the first place?

Maybe the answer is just making what we have better? For example, I believe (someone correct me if I’m wrong) the dora-rs benchmarks are comparing against rclpy? If that’s the case then we could possibly make rclpy’s story better by having a rosidl_typesupport_XYZ_py set of packages, so the middlewares could handle PyObject * directly from our user’s Python code. Right now we have to convert from the PyObject * to our C-style struct for the message before handing it to the middleware via rcl_publish()/rcl_take*(), which is very inefficient, especially for large data structures like images and pointclouds. Even though we’ve tried to improve the performance there using optimizations and things like numpy.

I hope so too, and there’s a REP for how to do this, it just needs resources. And I personally believe the strategy in what was proposed as REP-2011 is a “yes, and” for the idea of alternative serialization libraries, as I believe it complements features that people already use to evolve types (like optional fields), in most libraries I’ve studied at least.


Curious to see what you and others think.

3 Likes

Off the top of my head (and I may be forgetting about a few things):

  • C and C++ runtime representations map message members straight to struct members. That makes it hard to evolve them. In particular, it is really hard to decouple storage from interface in a backwards compatible way. The flatros2 experiment approximates the C++ runtime representation using lvalue reference members and operator overloads. C is too far gone. Python is a lot more forgiving, with data descriptors and all.
  • Loaning APIs all the way down to the rmw layer seem to have been designed for POD messages. They assume (and require) message types fully define in-memory layout and there’s little margin to defer that to runtime – largely because there’s also the assumption that message type support information is invariant. flatros2 solves this with dynamic typesupport and pre-serialized message prototypes, but those have a memory footprint that may be undesirable.
  • Ownership over loans is not fully sorted out (which we knew already, see Honor the user holding onto shared_ptrs during subscription callbacks · Issue #2401 · ros2/rclcpp · GitHub ). We also lack suitable idioms to ensure users don’t go about exhausting middleware resources by holding onto loans. The flatros2 experiment invalidates views for this.
  • rclpy has no support for loaning whatsoever.

Also, I have yet to find a zero-copy transport with support for inter-process pipelines, though it appears to be in the roadmap for iceoryx2. Think process A allocates shared resource M and passes it to process B, process B modifies resource M and passes it to process C, and so on. It’s a common pattern that right now cannot escape (de)allocations and copies.

That needs attention, yes. I’d think that as long as memory layout and access patterns are appropriate for modern processor architectures we need not sacrifice performance. The flatros2 experiment already shows how one could go about numeric array views, using numpy.frombuffer (here) and std::span<T, N> (here). Numpy can already accommodate multi-dimensional data. For C++, there’s std::mdspan (or kokkos::mdspan if the jump to C++23 turns out to be a bit much).

Are the C/C++ data structures generated from the type definitions the problem that needs to be addressed?

In part. Getter / setter APIs would have been easier to evolve.

Or is it the need to do zero-copy?

There’s some of this too. Many robotics applications and stacks are confined to a single host. Zero-copy transport can take you a long way in that case, and we know that, intra-process communication in rclcpp is today’s usual answer to performance bottlenecks.

I think zero-copy between processes using the same language is possible using loaned messages and even with the standard rosidl c++ structs if you’re using plain old data.
And if you want to use something like flatbuffers or capnproto or the like, then I think you could do that by adding additional rosidl generators.

That’s all true, but the point I’m trying to make is that those are not really viable. IMHO ROS 2 biggest advantage is its ecosystem and community. If I switch to a middleware that puts heavy restrictions on the messages I can use (your message has a header with a string frame id? no zero copy for you), or I simply change the messaging format, I lose that advantage.

It looks like you’ve avoided the need for a new rosidl_generator_flatbuffer-like packages in your flatros2 PoC by using some reflection

That’s right. flatros2 actually uses flexbuffers, but this was just a shortcut to keep things “simple”.

you’d probably want some build-time step

Absolutely. For proper integration I’d use true flatbuffers and compile .fbs schemas adapted from rosidl interface files.

Maybe the answer is just making what we have better?

I think there’s a separate discussion to be had about rclpy.

So, what’s the proposal?

That’s a good question and the toughest one to answer. I’ll preface with some of the premises I’m working with:

  • Messaging improvements are rosidl improvements. There’s just too much code out there that depends on rosidl generated code to move away from it, and because of it, an out of tree alternative will either get marginal attention and go to waste or show promise and fragment the community.
  • rosidl changes must be backwards compatible (or at least approximately so), for the same reasons laid out above.

So with that in mind, there’s two parts to this: (a) the (re)design of the messaging system, and (b) the backwards compatible implementation and rollout.

For (a), flatros2 hints at one possible design: messages encode the structure of the data but it’s not until instantiation (or reception) that data materializes. A camera driver doesn’t publish some image, it publishes a QVGA image with BGR8 encoding, so it has all the information it needs to bound what’s unbounded in sensor_msgs/msg/Image. An image viewer doesn’t need to know that, it only needs to know that it is subscribing a topic with sensor_msgs/msg/Image structure. This notion helps separate message interface (view) from data (buffer), which can be managed by the middleware if enough information makes its way down there.

For (b) I only have some of the pieces of the puzzle, so bear with me:

  • Build next-gen C, C++, Python (, Rust?) rosidl generators
    • Messages are views over contiguous buffers
      • Views are ~API compatible with current structs
    • Message instantiation takes additional input (how?)
      • Need to allocate the right buffer size
    • Message reception just wraps the incoming buffer
    • Built and installed alongside current generators
      • #include “sensor_msgs/future/msg/image.hpp” ?
  • Build next-gen rosidl type support for middlewares
    • Transport buffer as blob, optionally duplicate specifics (e.g. key fields)
    • Should there be a standard buffer format for middlewares to rely on?
    • Built and installed for next-gen generators only
  • Extend rmw/rcl/rcl* APIs to take additional input for loaning

And then a long tail of deprecation cycles for future to be current.


FWIW I’m not going to solve this on a Discourse post, even if I tried. There are multiple REPs in that last bullet list alone. This is only feasible as a community effort.

I had not seen that REP :heart:. I’ll give it a read.

1 Like

That’s fair. I’m going to reply to several things, just to add to the memoranda here, but I agree this will need a champion and REPs (or their equivalent).


Hmm, perhaps, but I’m also not totally convinced. For instance, one of the main issues is with how strings and sequences are presented, and if you went with std::string and std::vector (as we did, and keep in mind that things like string_view and spans didn’t exist at the time), even method based access wouldn’t have made changing the underlying storage or ownership easier. At the time we thought it made more sense to use the existing containers in the STL, rather than rolling our own, and I think that was probably the right call at the time. Even now, I would try to fit what we want into one of the standard approaches unless that was absolutely impossible.

I don’t disagree, but zero-copy brings with it several constraints/issues that may not be ideal for every situation (maybe even most situations), e.g. dealing with exhausted shared resources (in slow-consumer, fast-producer situations). Also, it’s just my hunch (someone may convince me otherwise), but I do not believe that most robotics frameworks actually use zero-copy, maybe except for real-time control related spaces, and that’s actually fine for most situations. In my opinion, unless you need zero-copy for extreme performance or real-time related issues, making a copy (or serializing and deserializing) is a good solution much of the time.

I can see why you think the current options are not really viable, since they have a lot of restrictions, but I don’t think those restrictions are impossible to overcome in the current system. For example, if we had a new C++ rosidl generator that used std::string_view and/or std::pmr::string then we could probably allocate space for the strings and sequences from a shared memory segment using a custom allocator. The flexbuffer you’re using is kind of similar. Then if we added a new typesupport package that combined the new C++ generator with each middleware (cyclone/fastdds/zenoh), we could probably support zero-copy with variable length strings and sequences, which would then allow the standard messages to work. Now, you’d still have to set upper bounds on the available shared resources.

Also, related to that, we often talked about a way to set artificial limits on specific unbounded message fields on a per application basis, e.g. for your real-time controller you could say that the transforms.header.frame_id field from the tf2_msgs/msg/TFMessage message is limited to a string of 1024 characters, such that if some other ROS node sent a longer value for that field, then the message would discarded by your controller node. I don’t know if it properly got described, but I remember we discussed it while working on Zero Copy via Loaned Messages (I think @mjcarroll worked on that one too). With something like that you could potentially support zero-copy as well, since you could at least predict the maximum size of the data structures. But that would still require a new rosidl generator and require both sides to use that data structure, loaned messages, and perhaps other restrictions. But I don’t see us getting to a place where zero-copy is always on for all (or even most) situations between processes on the same host.

In any case, most of what I’ve described above requires a new rosidl generator, which I believe you think (correct me if I’m wrong) means fragmentation in the community, but I don’t think that’s necessarily the case. Now, I can see how this could lead to fragmentation due to some nodes using the features and some not using them (e.g. I want to use a laser scanner driver, but it uses normal messages and so can’t do zero-copy, so I fork it and change it to use the method-based messages and loaned messages). But from a different perspective, we can introduce new rosidl generators, folks can use them to get access to different features and/or different serialization libraries and still not cause fragmentation in the sense that existing tools and nodes will work with them, just maybe not as efficiently as could be possible if you were to rewrite the tools/nodes to use the new thing. Meanwhile, pr’s to update the tools/nodes to use the new message structs would be lower risk because they would still work with non-updated nodes (e.g. if you were to update rviz to use “future” message structures, nodes that publish the old style will still work with rviz after the change). So, I’m just not convinced that it will cause fragmentation in the ecosystem, but I could be wrong.

This is actually very similar to a related feature which folks at nvidia (@HemalShahNV) have been talking about which is to have a “rcl tensor” which in part boils down to a version of standard messages that pass around handles to data in the GPU rather than the data itself, and ideally doing this in a way that doesn’t fragment the ecosystem or eliminate the benefits of the approach.

And I think what I’ve been talking about here (add new rosidl generator, demonstrate it’s value, maybe eventually make it the default via tick-tock) is in part what you’re talking about for “part (b)” as well. The only thing I’m not sure about is why this new data structure API must be a view onto a contiguous buffer. And if we decide that necessary for some reason, then maybe using an existing approach to that, like flatbuffers, might make more sense than defining our own API. There is value in having a ROS specific API for interacting with data, to protect ourselves from vendor lock-in, but there’s always a cost associate with that abstraction. For our current member-based structures that means copies or (de)serialization directly to our structure, and for a method based API it might require the same under the hood, or perhaps storing some of the fields twice (once in the original form or serialization buffer and a second time as the requested format in the getter) and keeping them in sync, etc. There’s something really nice about the simplicity of the member-based structure we have now, in that once you have a copy of it, there’s very little magic behind it, making it easier to understand and predict.

This isn’t something we necessarily have control over in order to use DDS most efficiently. For other middlewares like zenoh this might make sense, but essentially this is what the DDS middlewares are doing themselves anyway (transport as binary data and send some data in duplicate or hashed).

Do you mean in-memory layout of the data structures or the layout of serialized data? For the latter, we have that and it’s CDR right now. But that could be changed, because we’ve tried to be careful not to assume CDR anywhere (e.g. we store this information in rosbags).