How to Implement End-to-End Tracing in ROS 2 (Nav2) with OpenTelemetry for Pub/Sub Workflows?

lcmasdf · August 4, 2025, 6:44pm

I’m working on implementing end-to-end tracing for robotic behaviors using OpenTelemetry (OTel) in ROS 2. My goal is to trace:

High-level requests (e.g., “move to location”) across components to analyze latency
Control commands (e.g., teleop) through the entire pipeline to motors

Current Progress:

Successfully wrapped ROS 2 Service and Action servers to generate OTel traces
Basic request/response flows are visible in tracing systems

Challenges with Nav2:

Nav2 heavily uses pub/sub patterns where traditional instrumentation falls short
Difficult to maintain context propagation across:
- Multiple subscribers processing the same message
- Chained topic processing (output of one node becomes input to another)
- Asynchronous publisher/subscriber relationships

Questions:

Are there established patterns for OTel context propagation in ROS 2 pub/sub systems?
How should we handle fan-out scenarios (1 publisher → N subscribers)?
Any Nav2-specific considerations for tracing (e.g., lifecycle nodes, behavior trees)?
Alternative approaches besides OTel that maintain compatibility with observability tools?

christophebedard · August 4, 2025, 11:26pm

I’ve worked on something very similar using the built-in ROS 2 LTTng tracing instrumentation (on Linux). In case you haven’t seen it yet, take a look at [preprint] Message Flow Analysis with Complex Causal Links for Distributed ROS 2 Systems

That paper doesn’t include any concept of high-level requests like your Nav2/control examples, but it could be added on top, with extra processing, assuming that the user defines their “request.” For instance, the internal tool I mentioned in my ROSCon 2023 talk (slides, video) has a feature that lets users define the start and end of their processing pipeline (e.g., specific publisher to specific subscription), creates graphs for that processing pipeline, and then extracts & plots end-to-end latencies over time.

Note that services aren’t supported by the work presented in the paper above, but it’s more or less an extension of the current pub->sub logic, i.e., client->server->client.

That depends on your use case. What would this mean for a “move to location” request? For example, maybe you need to build the whole pub->sub or request<->reply graph and consider the request to be complete only when the last subscription receives the message, or when the last relevant service reply is received.

Topic		Replies	Views
[preprint] Message Flow Analysis with Complex Causal Links for Distributed ROS 2 Systems ROS General tracing , ros2_tracing , preprint , paper	5	2307	June 30, 2022
Ros2_tracing demo and tutorial ROS General tutorial , tracing , demo , ros2_tracing	0	1345	October 21, 2021
Looking for feedback on ros2_tracing and use-cases/needs for performance analysis & visualization ROS General ros2 , tracing	1	913	February 5, 2021
Message Flow Analysis for ROS Through Tracing Projects	6	2125	December 11, 2020
Alpha release of ROS 2 tracing tools ROS General ros2 , tracing	3	1739	July 19, 2019

How to Implement End-to-End Tracing in ROS 2 (Nav2) with OpenTelemetry for Pub/Sub Workflows?

Related topics