Design discussion: “discoverable not serviceable” ROS interfaces for lifecycle nodes

Hi all,

We’d like to start a design discussion about a lifecycle-related gap that shows up in Nav2 but seems fundamentally ROS 2 / rclcpp-level.

Context / links

What we want (high-level)

For lifecycle nodes, it’s very useful to have all interfaces created and visible while the node is Inactive so tooling and other nodes can introspect what exists (topics/services/actions/etc.). But while Inactive, those interfaces should be not serviceable — i.e., they should not actually “do work” or execute callbacks until the node transitions to Active.

This is straightforward for some interface types:

  • Publishers / clients can exist early and simply refuse to do work when asked (check a flag → return / no-op / error).

But it’s harder for:

  • Subscriptions and services, because once they’re created they can become serviceable immediately (executor dispatch / callbacks), unless there is a mechanism to say “no” before dispatch and (importantly) before consuming from the middleware queue.

The core gap

We want an interface that is:

  • Discoverable (exists on the graph / matched / visible to introspection) in Inactive
  • Not serviceable (no callbacks executed / no work performed) until the node is Active

The tricky part is that “not serviceable” should ideally mean more than “drop in the callback”:

  • If a subscription still takes samples from the RMW queue while Inactive (even if we drop them afterward), we may permanently lose data that is not re-sent later (esp. for Transient Local / latched-like topics).

LifecycleEntity state model (concise proposal)

This is a rough state/semantics model that seems to match user expectations, plus feasibility notes.

Unconfigured (not configured yet)

  • Ideal semantics: not discoverable (should not appear in ros2 topic list / ros2 service list).
  • Reality today: hard to guarantee, because creating an rclcpp interface usually creates the underlying RMW entity (e.g., DDS DataReader/DataWriter), which becomes discoverable immediately.
  • Implication: “unconfigured == not discoverable” may require future rclcpp↔rmw interfaces/state control, so it’s likely a longer-term enhancement.

Inactive (discoverable but not serviceable) — the missing piece

  • Desired semantics: discoverable/matched on the graph, but no work is performed.
  • For subscriptions, “not serviceable” should ideally include not consuming from the RMW queue while inactive (so data is preserved per QoS and can be processed after activation).
  • Possible approaches:
    1. rclcpp-level gating before take/dispatch: when data-ready triggers, skip take() and callback dispatch until Active.
    2. rmw-level paused state: entity is discoverable but rmw does not deliver/allow take until Active.

Active (normal behavior)

  • Take from the RMW queue and dispatch callbacks as usual.
  • On activation, there may already be queued samples (QoS depth/history/transient-local), so activation may result in immediate processing of multiple callbacks.

Looking for feedback / prior art

We have a preliminary plan around the approaches above (especially “don’t take while inactive” vs an rmw-level paused state), and would really appreciate suggestions or references to prior work.

In particular:

  • Do DDS implementations (or Zenoh / rmw_zenoh) already have an internal notion of managed states for endpoints (discoverable but paused / not serviced) that could be mapped cleanly onto ROS 2 lifecycle entities?
  • If not, where do folks think the cleanest abstraction should live (rclcpp vs rmw), and what pitfalls should we watch for?

Thanks in advance for any design suggestions or pointers to existing discussions/issues.

3 Likes

I got the feeling what you really want here is a pre defined model of your node.

E.g.: You want a guarantee, that you node will only provide these services / topics and only consume this or that data. Armed with model data you can do a lot of nice things, like check if you got unconnected consumers or compute startup orders of your nodes.
A while back someone posted a model to node generation approach, I think it was someone from fraunhofer (but my memory is hazy there…)

On a more practical level: we ran into the same problem and solved it for our own lifecycle system like this:
Every publisher subscriber etc is instantiated in on configure. We use special subclassed entities (publishers, subscribers and services).

If something in the business logic is publishing or calling a service while not active, an exception is thrown, as it is a bug.

All data received prior to being moved to active is discarded. Note if the connection is transient local, we buffer the last msg and dispatch directly after the move to active. Service calls are rejected if not active.

All entities are registered at the lifecycle and are automatically ‘armed’ and ‘disarmed’ during the transitions from inactive to active and back.

All of this is a high level concept, and I don’t see it as related to the rmw layer at all. Also not that depending on your application, you will most likely run into corner cases, were you will need to break the rules and allow processing of some data in any state.

Couldn’t this be technically solved by content-filtered topics? I’m not sure if it’s possible to update the content filter during runtime, though, and I’m also not sure how it behaves with latched topics…

Even if it could, this is a general purpose problem that needs a solution IMO. I think what we’re asking for here is to have a version of create_subscription with a subscription option / bool to indicate that this should be discoverable but not accepting anything from the waitset or eq. yet. I think the need for lifecycle enabled subscriptions / services warrant this.

The reason Actions don’t need this is because Actions can simply reject a goal request and tell the client it was outright rejected. Services* and Subscriptions don’t have that ability since they simply process data and can’t tell the client/publisher that it was inactive to try again later or ROS 2 to resend a transient local topic (i.e. map) again once ready. We end up dropping sent-once data as a result. Timers I imagine have the same issue.

Callback-triggering interfaces need (I think) some way of being created but inactive on the middleware to process if we want the ability to have lifecycle-driven interfaces where we meet the Lifecycle/Managed Node Design intent of having the allocations happen in the Configure transition. That is the cleanest solution that I can see & creates an important capability with additional use-cases. Else, we would have to create all subscriptions/services in the Activate transition, which I don’t think anyone wants.

/* (generically; if you add a ‘success’ bool in the response, sure, but then you’d have to add error codes galore to know if its a server error or a activation error, etc)

1 Like

I think you can implement what you want with a custom waitable.

Just don’t process the data from your inner entities as long as inactive and what you will get is

  • Discoverable entities
  • All data after creation is buffered in the rmw layer

It this the behavior you want to archive ? Or did I miss something ?