Is there a working group for maintaining ROS 2-based robots in industry? 🤖

Hi everyone,

We’re curious — does a dedicated working group (or similar community) already exist for maintaining and operating ROS 2-based robots in industrial environments? If not, maybe it’s time to build one.

At Siemens, our ROS 2 efforts are focused on four key challenges:

  • :rocket: Shipping — How do you bring ROS 2-based systems into real industrial deployments?
  • :bug: Debugging — How do you quickly find bugs when your machine runs not just ROS 2, but also PLCs, HMIs, network switches, safety sensors, and more?
  • :counterclockwise_arrows_button: Updates — How do you keep your software reliably up to date?
  • :magnifying_glass_tilted_left: Fleet health — How do you detect critical bugs locally and across your entire fleet?

We’d love to connect with the community and learn what’s already out there! :globe_showing_europe_africa:

We’re actively looking to engage with others working in this space — whether you’re building solutions, facing the same challenges, or have already found answers we haven’t discovered yet.

Here are some data points we’ve gathered so far:

Exciting tools that just dropped :hammer_and_wrench:

The community has been busy! A few noteworthy new tools:

Big shoutout to @doisyg for sharing impressive insights on how they manage upgrades across a large fleet of robots in the field! :clap:
And I am sure there is a big vast of further open source tools out there that can help all of us.

What Siemens has shared so far (all talks in English)

We’ve been open about our own challenges and learnings:

:speech_balloon: Our concrete question to you:

Would you be interested in joining a regular working group to discuss these topics and align our open-source efforts?

Vote below — even a single click tells us a lot! :backhand_index_pointing_down:

  • ← click me, if you are interested.
0 voters

Let’s build in the open — together! :handshake:

We’re strong believers in open collaboration. Whether you’re a researcher, developer, or industry practitioner — let’s align our efforts and avoid reinventing the wheel.

A few things we’d especially love to hear about:

  • :megaphone: Open EU calls related to these topics — always happy to explore funding opportunities and collaborative projects
  • :graduation_cap: Bachelor & Master thesis requests from EU students — if you’re looking for a meaningful, real-world topic in this space, reach out! We’d love to support the next generation of robotics engineers

Cheers from Germany :clinking_beer_mugs:
Florian


Update as of 2026-03-03T23:00:00Z

Let’s try to ball point where a potential virtual meeting could already happen:
(Please also vote if the day does not fit, right now I am more interested in finding the right time of day)

  • 2026-03-10T07:30:00Z
  • 2026-03-10T10:00:00Z
  • 2026-03-10T13:00:00Z
  • 2026-03-10T16:00:00Z
0 voters
7 Likes

I’d definitely be interested in this. I’ve been looking at how to deal with this for my team at my company for a while now (OTTO Motors by Rockwell Automation).

3 Likes

Thanks for starting this thread, @flo , I think it would be great to have a forum for these topics.

In terms of monitoring fleet health, there is a Transitive capability for that, too. Unlike medkit, it works from anywhere in the world, no VPN required, and also shows you the health history, not just the current state, and lets you access all data programmatically.

Built on the open-source Transitive framework: What is Transitive? | Transitive Robotics

1 Like

Hi @flo, yeah, the debugging angle is exactly what pushed us to build ros2_medkit. Once you have PLCs, HMIs, and safety sensors in the mix alongside ROS 2, ros2 topic echo and scattered logs just stop scaling.

What we did was model the robot as a structured entity tree (Area / Component / Function / App) and expose everything over a SOVD-aligned REST API: live topic data, service/action calls, parameter management, fault history. SOVD felt like the right anchor because it’s the same model the rest of the vehicle/machine stack already speaks, so your diagnostics tooling doesn’t need to know it’s talking to ROS 2 under the hood.

For fault management we took the AUTOSAR DEM approach: debounce, two-level filtering, SQLite persistence, SSE for real-time updates. There’s also an MCP server wrapping the gateway for AI tooling, with a plugin system via Python entry_points so teams can bolt on their own diagnostic logic without touching the core. We’ve been using that for fault correlation in CI.

Honest gaps: shipping and fleet-scale updates are on the roadmap, not in yet. Fleet health beyond “query a single robot’s REST API” is still something we’re working through.

Would definitely be up for a working group. This space is too fragmented to solve alone.

@chfritz, thanks for the mention. Few clarifications though, I think you’re comparing different layers here:

“works from anywhere, no VPN” is a deployment topology decision, not a capability gap. The ros2_medkit_gateway exposes a REST API with built-in JWT RBAC. Where you expose it is up to you.

“shows health history” is there. Fault history persisted in SQLite with full lifecycle tracking. For raw timeseries you bring your own backend, we intentionally don’t prescribe the monitoring stack.

“programmatic access” is literally the whole point. SOVD-compliant REST API plus MCP tools for LLMs.

Honestly Transitive looks more complementary than competing. Cloud-centric fleet management + OTA on one side, edge-native diagnostic gateway on the other. Different layers of the same problem.

1 Like

If a robot exposes a REST API then that means that it acts as a server and needs it to accept incoming connections. That’s an architectural flaw. It doesn’t scale to have a VPN (or tailscale) running on each robot and connecting to each one separately – especially since many will be intermittently offline.

I think that medkit is cool. But rather then building yet another robotics stack to support it, I think you should consider using Transitive for all the communication, deployment, storage, data-sync, etc., and just focus on the capability itself (SOVD inspired health monitoring). If you did, you’d get the benefits of both. Creating new capabilities for Transitive is easy and well documented: Creating Capabilities | Transitive Robotics

This is pretty much what ROS did, too. Rather than everyone building their own middle-ware (like we used to), it finally created a way for people to collaborate on middle-ware and let everyone focus on their functionality instead of spending 80% of their time on plumbing.

The real power of open-source comes from the network effect enabled by collaboration!

The server model isn’t an architectural flaw - it’s how SOVD defines the architecture. Same model as UDS over IP, DoIP, and OPC-UA.
Pull-based diagnostics has been the norm in automotive for 30 years. You query a system when you need to diagnose it. For robots behind NAT, a simple HTTP relay suffices - you don’t need a full fleet management platform for that. Fault history is persisted locally anyway, so an intermittently offline robot doesn’t lose diagnostic data. Connectivity concerns are solved at the infrastructure layer, not the diagnostic layer.

On building on top of Transitive - what you’re describing is replacing a vendor-neutral industry standard (SOVD comes from the automotive world - think of it as the diagnostic equivalent of what DDS is for ROS 2 messaging) with a dependency on a specific commercial platform. The ROS analogy cuts the other way - ROS succeeded because it’s fully open, vendor-neutral, and community-governed. SOVD compliance gives ros2_medkit interoperability with any SOVD-capable tool, tester, or cloud backend - including a Transitive bridge if someone wants to build one. That’s the network effect from standards, not from platform consolidation.

1 Like

@chfritz I think there’s a fundamental misunderstanding here about what diagnostics actually is.

Reading /diagnostics, collecting CPU temperature, and syncing it to ClickHouse is health monitoring - the “check engine light” level of the problem. What happens after the light turns on is where diagnostic engineering begins, and it’s an entirely different discipline.

A few examples of what SOVD-compliant diagnostics involves that health monitoring doesn’t touch:

  • DTC lifecycle state machine - A fault is not a boolean. It transitions through PREFAILED → CONFIRMED → PREPASSED → HEALED → CLEARED with counter-based debouncing. A 2ms motor encoder glitch must not trigger the same response as a sustained velocity mismatch. This is the AUTOSAR DEM model that the automotive industry has relied on for safety-critical systems for over two decades.
  • Freeze frames - When a fault is confirmed, the system snapshots all relevant sensor data at that exact moment: motor speed, battery voltage, IMU orientation, joint positions. A technician diagnosing a 3 AM failure needs to see the system state when it happened, not whatever happens to be current.
  • Multi-source deduplication & severity escalation - If both the motor encoder and wheel odometry report the same velocity mismatch, that’s one fault with two reporting sources and automatic severity escalation - not two independent alerts flooding an operator dashboard.
  • Structured entity tree with dependency analysis - Faults organized in an Area → Component → Function → App hierarchy, dynamically constructed from the live ROS 2 computation graph. “If motor_controller fails, which downstream functions are affected?” is a question health monitoring can’t even parse.
  • Rich query semantics - GET /faults?status[confirmedDTC]=1&severity=2 - filtering by DTC status bits, severity levels, time windows, diagnostic scopes.

SOVD exists because these problems were already solved in automotive - and robotics is now hitting the exact same wall. We didn’t invent this complexity, we’re porting 20+ years of proven diagnostic engineering to ROS 2.

On the “architectural flaw” argument: edge-native diagnostics is a deliberate choice. When a robot’s motor is overheating in a factory with no internet, the fault manager still needs to debounce, confirm, capture freeze frames, and persist faults - locally, with deterministic latency. You cannot run safety-critical fault management through cloud MQTT sync with eventual consistency. The REST API is an access layer, not the engine. Where you expose it is a deployment decision.

On “just build it as a Transitive capability”: the fault manager alone has a DTC state machine, counter-based debounce, two-level filtering, multi-source aggregation, SQLite persistence, and SSE streaming. The SOVD gateway exposes ~40 REST endpoints for faults, entity trees, live data subscriptions, service/action proxying, and parameter management. Suggesting this should be a plugin on top of a pub/sub sync layer is like asking a database engine to be a web framework middleware - it fundamentally misunderstands what the system does and why it owns its own state.

I’d genuinely encourage looking into SOVD and AUTOSAR DEM before drawing equivalences between health monitoring and diagnostics. They’re different layers of the stack solving different problems at different levels of complexity.

1 Like

@Michal_Faferek not sure what you are trying to convince me of. I already encouraged you to build a separate capability from the Health Monitoring one I linked to above, which I wouldn’t do if I felt that our existing capability already did everything medkit provides – it doesn’t.

Ouch! After how many seconds of looking at Transitive did you come to the conclusion that it is just a pub/sub sync layer? Everything you describe would fit well within a Transitive capability, and locally exposing a REST API on the robot is not a problem at all, nor is owning your own state. But like I said, you could also get the benefits of Transitive and avoid some duplication (e.g., SSE streaming, live data subscriptions, and service/action proxying).

I’m in violent agreement! Please be sure to understand that Transitive and the specific above linked Transitive capability are two different things and you seem to be trying hard to compare medkit to both at the same time. I’m instead suggesting to see that medkit is similar but different from our Health Monitoring, but shares many of the same requirements for a robotics full-stack application – like many more functionalities, which is why we built Transitive in first place, so that people could stop putting web servers on robots. They may be useful during development, but just don’t scale to fleets.

@chfritz let’s rewind. You came into this thread and wrote “Unlike medkit, it works from anywhere, no VPN required, and also shows you the health history” - a direct comparison implying medkit lacks these features. It doesn’t. We responded with technical clarifications, and you’ve now dismissed them with “not sure what you are trying to convince me of.” You started the comparison - we’re just correcting it.

I took a closer look at Transitive’s Health Monitoring capability to make sure I’m not misjudging it.

Here’s what it does, per your own docs:

  • Reads the ROS /diagnostics topic
  • Collects CPU temperature, GPU temperature (NVIDIA only), and disk space
  • Syncs to ClickHouse in the cloud
  • Shows sparklines in the UI
  • JSON export

That’s it. That’s the capability you compared to ros2_medkit. There is no fault lifecycle management, no debounce logic, no freeze frame capture, no severity model, no entity tree, no structured query API, no local persistence, no deduplication, no standards compliance of any kind. It’s telemetry forwarding with a dashboard - a perfectly fine tool for what it is, but it has literally nothing in common with diagnostic engineering.

You’re still calling what we do “SOVD inspired health monitoring.” SOVD is not a flavor of health monitoring. It’s a diagnostic standard (ISO 22901-4) covering fault lifecycle state machines, freeze frame capture, structured entity modeling, and query semantics.

On the REST API being an “architectural flaw” - you’re repeating the same argument @bburda already addressed. The SOVD spec defines a pull-based diagnostic interface. Same architectural pattern as UDS/DoIP in automotive, used for 30 years. Connectivity (NAT traversal, relays, tunnels) is an infrastructure concern solved at the infrastructure layer. Coupling it into the diagnostic runtime is the actual architectural mistake.

I appreciate the enthusiasm for open-source collaboration, but it starts with understanding what the other project does. Everything I listed - DTC state machines, freeze frames, deduplication, entity trees, severity escalation - went unaddressed in your reply. If you’ve reviewed the SOVD spec and AUTOSAR DEM and still think this fits into a Transitive capability, I’d be genuinely curious to hear how.

How much more do you want? Why do you keep trying to convince me that I’m disagreeing with you? I’m not!

I was inviting you to collaborate to combine the benefits of the functionality of medkit with the “connectivity” (as you called it) benefits of Transitive, the framework for building full-stack capability, not the Health Monitoring capability.

That’s because, again, I’m not trying to claim that Health Monitoring is the same as medkit. All I said was that they are related.

As you pointed out yourself, what the Transitive framework provides is largely orthogonal to the core functionality of medkit, so there doesn’t seem anything fundamental that would prevent this. But the details are for you to figure out, in case you are interested in making medkit available as a Transitive capability, and be listed here, making it installable with one click. We are collaborating with a number of people on developing Transitive capabilities, because they realize all the deployment and communication benefits it provides and how much it helps with distribution. But we cannot dive into each one ourselves and decide what makes sense. It’s like apps: the app authors need to decide whether it makes sense to make the app available on the iOS and Android stores or not.

Robotics companies don’t really want to install five different agents and five different web servers to get five different features. So collaboration along those lines has many benefits.

@chfritz, your last comment made the pitch clear: you’d like ros2_medkit to be “listed in the Transitive marketplace, making it installable with one click,” comparing it to publishing an app on iOS or Android.

I think this reflects a misunderstanding of what diagnostics is in the context of a robotic (or any complex) system. In automotive - where SOVD, DoIP, UDS, Fault Manager, etc. come from - diagnostics is not an application you install on top of an existing stack. It’s foundational infrastructure that the rest of the system depends on. The diagnostic runtime has to be there before OTA can decide whether it’s safe to update, before the operator can understand why a robot stopped, before fault correlation can prevent cascading failures. It’s designed into the system architecture from day one, not added later as a plugin.

Suggesting that this should be a one-click installable capability in a marketplace - next to a web terminal and a file sync tool - is a category error. It’s like suggesting that the ECU diagnostic layer in a car should be an app you download from an app store.

On the “open-source collaboration” framing - Transitive’s model is an open-source framework with a commercial capability marketplace charging $5-$20/robot/month. That’s a legitimate business model, but let’s call it what it is. Proposing that independent open-source projects should rebuild themselves as capabilities for your commercial marketplace is not open collaboration - it’s distribution through your platform.

ros2_medkit is built on SOVD, an open vendor-neutral standard. Anyone - including Transitive - can build a client, a bridge, or a dashboard on top of that API. That’s how standards-based collaboration works: you integrate at the interface, not by absorbing one project into another’s platform.

To be clear - I have no problem with you promoting Transitive in this thread. It’s a relevant project and this is the right forum for it. But when you make incorrect claims about ros2_medkit capabilities or architecture to frame your pitch, I’m going to correct them. That’s all this has been.

2 Likes

What a vivid discussion and already great responses this has created so far!
Thank all of you that have engaged with this post so far.

Regarding the time of the meeting, I would like to add another poll that indicates possible time slots. We are from Franconia in Germany and I would very like to not fight over this activity with my girlfriend.. :person_fencing:

So I start with these timezone poll in the main post, to get a feel what would work best.
2026-03-10T07:30:00Z
2026-03-10T10:00:00Z
2026-03-10T13:00:00Z
2026-03-10T16:00:00Z

Regarding the rest of the discussion:

  • we need to establish some work mode rules: base “hygienic” maintenance kit should be OSS & functional, paid platforms / business on top is highly welcome but should not endanger the base mechanics
  • what is the scope of this, where does it hand over to other working groups etc.
  • working with standards (also outside robotics) is highly welcome and should be a strong goal. (maybe not fix too hard on SOVD and see what automation or other industries also have to offer, but a very good start I must say)
  • Further stuff, as soon this get’s kicked off and we find people to work on this, like:
  • Find a place to host this activity and maybe think from a UX perspective to put everything into one meta-package for ease-of-use

Everything up for discussion, just writing down my personal intentions of today.
Let’s see where this goes and please share this post or my LinkedIn post to other interested parties.

Cheers,
Florian

5 Likes

I would like to throw a related topic into the ring : SCADA

Anyone ever integrated into a SCADA system ?
I did not find exampled for this searching around, or as always searched for the wrong strings…
Or is the answer just use OPC-UA ?

@flo I would like to participate, but won’t be available on march 10th. Could you perhaps also propose a date a week later ?

No, that’s not correct. We have open-source and free capabilities in the store, too, incl. the Health Monitoring one I mentioned, and if you don’t want to distribute your capability via our marketplace, no one will force you (just like Android doesn’t). But I hear you loud and clear, you don’t want to use this (free!) distribution channel to reach more users. I’m sorry I asked.

Then how can it sit on top of ROS? On the robot, Transitive capabilities are typically ROS nodes, and I think, correct me if I’m wrong, medkit is a ROS node, too, no?

@chfritz 4 out of 12 capabilities are free, the rest are $5-$20/robot/month - so the characterization of a commercial marketplace stands. But that’s beside the point.

On “medkit is a ROS node too, so how is it foundational?” - the Linux kernel is an ELF binary on disk, and so is your cat command. Doesn’t make them the same architectural category. What makes something foundational is what depends on it, not what process model it uses.

When you integrate ros2_medkit, your other nodes start actively reporting faults through a client library (FaultReporter) or the diagnostic bridge - severity levels, fault codes, pass/fail events. The fault manager runs those through an AUTOSAR DEM-style debounce state machine (PREFAILED → CONFIRMED → HEALED → CLEARED) with configurable thresholds. Correlation rules detect cascading failures: e-stop triggers, motor and drive faults arriving within 2 seconds are automatically muted as symptoms. Black-box recording captures freeze frames and rosbag ring buffers at fault confirmation - it has to be running before the fault happens.

The entity tree isn’t a ROS graph mirror either. It’s a semantic model - Areas, Components, Apps, Functions - defined via a manifest that describes what the robot physically is. A “Navigation” function might span 4 nodes across 2 components. Faults, data, operations, and configuration are organized by this domain model, not by topic names. Down the road, this same SOVD entity model extends to MCU-level diagnostics (UDS/DoIP bridges) - the REST API is the same whether the entity behind it is a ROS node or a bare-metal controller.

The point: once your nodes report to the fault manager, once your manifest defines the entity hierarchy, once correlation rules describe your failure modes - diagnostics is wired into the system architecture. You can’t swap it out with a one-click install any more than you can swap out the DEM layer in an ECU.

A capability that reads /diagnostics and forwards metrics to ClickHouse is a leaf node - nothing depends on it, you add or remove it freely. That’s a perfectly valid tool but a different architectural category.

Anyway - I get that you came here to show Transitive, that’s totally fair and this is the right place for it as I already mentioned. But I think we’ve gone back and forth enough and I’d rather not keep hijacking @flo’s working group thread. Let’s focus on the March 10th meeting and what the WG scope should look like.

If you want to dig deeper into ros2_medkit or explore whether there’s a genuine integration point - our Discord is open: selfpatch . Happy to walk you through the architecture there.

2 Likes

This is great, and I am here to support you with whatever you need (getting the message out, finding speakers, logistics, etc.).

A quick point of order: the OSRA has specific nomenclature regarding ROS groups to help keep roles and responsibilities clear. Here is the TL;DR:

  • Working Group: A group chartered by the PMC (8.5.f) that has a specific charter and a definitive lifetime to complete a task (e.g., Client Library Working Group).
  • Special Interest Group: A group chartered by the TGC (7.1.a) that has a specific charter and a definitive lifetime to complete a task (e.g., our new Physical AI Special Interest Group).
  • Community Group: The preferred name for everything else.

From what I’ve read above, it sounds like your initiative might fall into the last category (Community Group), at least initially. Transitioning to one of the first two groups is entirely possible, though there is a lightweight process associated with doing so.

1 Like