How to Implement End-to-End Tracing in ROS 2 (Nav2) with OpenTelemetry for Pub/Sub Workflows?

I’m working on implementing end-to-end tracing for robotic behaviors using OpenTelemetry (OTel) in ROS 2. My goal is to trace:

  1. High-level requests (e.g., “move to location”) across components to analyze latency

  2. Control commands (e.g., teleop) through the entire pipeline to motors

Current Progress:

  • Successfully wrapped ROS 2 Service and Action servers to generate OTel traces

  • Basic request/response flows are visible in tracing systems

Challenges with Nav2:

  • Nav2 heavily uses pub/sub patterns where traditional instrumentation falls short

  • Difficult to maintain context propagation across:

    • Multiple subscribers processing the same message

    • Chained topic processing (output of one node becomes input to another)

    • Asynchronous publisher/subscriber relationships

Questions:

  1. Are there established patterns for OTel context propagation in ROS 2 pub/sub systems?

  2. How should we handle fan-out scenarios (1 publisher → N subscribers)?

  3. Any Nav2-specific considerations for tracing (e.g., lifecycle nodes, behavior trees)?

  4. Alternative approaches besides OTel that maintain compatibility with observability tools?

I’ve worked on something very similar using the built-in ROS 2 LTTng tracing instrumentation (on Linux). In case you haven’t seen it yet, take a look at [preprint] Message Flow Analysis with Complex Causal Links for Distributed ROS 2 Systems

That paper doesn’t include any concept of high-level requests like your Nav2/control examples, but it could be added on top, with extra processing, assuming that the user defines their “request.” For instance, the internal tool I mentioned in my ROSCon 2023 talk (slides, video) has a feature that lets users define the start and end of their processing pipeline (e.g., specific publisher to specific subscription), creates graphs for that processing pipeline, and then extracts & plots end-to-end latencies over time.

Note that services aren’t supported by the work presented in the paper above, but it’s more or less an extension of the current pub->sub logic, i.e., client->server->client.

That depends on your use case. What would this mean for a “move to location” request? For example, maybe you need to build the whole pub->sub or request<->reply graph and consider the request to be complete only when the last subscription receives the message, or when the last relevant service reply is received.