Clean add/remove robots using full_control

I have a fleet with one robot operating with rmf_ros2 and rmf-web Humble.
The fleet adapter uses the FullControl API (RobotCommandHandle)
The full chain from rmf to robot is
rmf_ros2 → fleet_adapter→fleet_server → fleet_client (old DDS approach for legacy reasons)

The issue is that when a robot is shutdown (e.g. fleet_client is brought down), the api_server’s /fleet end-point keeps the robot in the fleet. The web dashboard keeps showing the stale robot and the last Updated time (Another bug) keeps ticking into current time.

[
{ "name": "omnit_ATR2", "robots": { "SIMULATION": { "name": "SIMULATION", "status": "idle", "task_id": "", "unix_millis_time": 1762423260202, "location": { "map": "L1", "x": 31.780725479125977, "y": -38.208396911621094, "yaw": -1.5786551237106323 }, "battery": 0.948580322265625, "issues": [] } } } ]

I’ve fixed the fleet_server to remove the stale robot from /fleet_states when the fleet_client disconnects or times out.

When a robot is available, it appears in /fleet_states . See below:

  • name: SIMULATION
    model: ATR2
    
    [removed fields]
    sec: 14
    nanosec: 750000000
    x: 31.769018173217773
    y: -37.97697448730469
    yaw: -1.5786962509155273
    obey_approach_speed_limit: false
    approach_speed_limit: 0.0
    [removed fields]
    

When the robot is shutdown, it is removed from /fleet_states

name: omnit_ATR2
robots: []

With code below, I’m able to remove the robot in the fleet_adapter, but cannot add it in again and get a segfault.

connections->fleet_state_sub = node->create_subscription<

    rmf_fleet_msgs::msg::FleetState>(

     rmf_fleet_adapter::FleetStateTopicName,

     rclcpp::SystemDefaultsQoS().keep_last(10),

    [c = std::weak_ptr<Connections>(connections), fleet_name](

     const rmf_fleet_msgs::msg::FleetState::SharedPtr msg)
     { 
        if (msg->name != fleet_name) {
           return;      
         }

         const auto connections = c.lock();

         if (!connections) {
          return;
         }

      // Track which robots are present in the latest FleetState message
      std::unordered_set<std::string> present;
      present.reserve(msg->robots.size());

      for (const auto & state : msg->robots) {
          present.insert(state.name);
          const auto insertion = connections->robots.insert(
          {state.name,nullptr});

          const bool new_robot = insertion.second;

          if (new_robot) {

          // We have not seen this robot before, so let's add it to the fleet.
             connections->add_robot(fleet_name, state);
          }

          const auto & command = insertion.first->second;
          if (command) {
          // We are ready to command this robot, so let's update its state
             command->update_state(state);
          }
      }

      // Remove any robots that are no longer present in the FleetState.
      // This ensures stale robots are fully deregistered from RMF, including
      // their negotiators, using FleetUpdateHandle::remove_robot(name).

      std::vector<std::string> to_remove;

      {
         auto lock = connections->lock();
         to_remove.reserve(connections->robots.size());

         for (const auto & [name, _] : connections->robots) {
             if (present.find(name) == present.end()) {
             to_remove.push_back(name);
         }
       }

      }

      for (const auto & name : to_remove) {

          RCLCPP _INFO(connections->adapter->node()->get_logger(),"Robot [%s]            absent from latest FleetState for fleet [%s]; removing from RMF",
name.c_str(), fleet_name.c_str());
        // Best-effort cleanup: clear the participant's itinerary so it is
        // removed from the traffic schedule. Full unregistration is not
        // available in this RMF version.
        {
           auto lock = connections->lock();
           const auto it = connections->robots.find(name);

           if (it != connections->robots.end()) {
                if (const auto updater = it->second ? it->second->get_updater() : nullptr) {
                   if (auto participant = updater->unstable().get_participant()) {
                // Clear itinerary to remove routes from the schedule.
                try {
                      participant->clear();
                } catch (...) {

                  // Ignore errors from clear and continue.
                }
              }
            }
          }
        }

      }

    });

Ran a back-trace and found this:

terminate called after throwing an instance of ‘std::runtime_error’
what(): [rmf_traffic_ros2::schedule::Negotiaton] Attempt to register a duplicate negotiator for participant [3]

Any recommendations to safely remove and re-add the same robot? I imagine its a very common scenario in real-world use cases.

Has this been fixed in an upstream branch?

There isn’t currently a way to permanently remove a robot from a fleet, but there is a commissioning feature to toggle a robot between being “commissioned” and “decommissioned”. Once decommissioned, the robot will no longer receive tasks from the dispatcher, so then the physical robot can be safely shut down or undergo maintenance. Then when it’s ready to be used again, it can be recommissioned and return to operation.

If you specifically need a robot to be completely and permanently removed from the fleet adapter, I’d be open to a PR related to that. The only reason I can think of for needing that is if you frequently introduce new robots with new names to the fleet adapter for a brief time, then remove them and never use those names again. This is a relatively uncommon way to operate, so we haven’t given much consideration to that part of a robot’s lifecycle.

@grey - thanks for the feedback. I couldnt give this thread much attention but I’ve come to some conclusions after doing some detailed tests across all interfaces to confirm my hypothesis.

Everything here uses latest rmf binaries for humble

To clarify, we are able to switch off a robot and bring it back into the RMF traffic system without any problems.

Here’s what happens
Robot “X” is brought up
Discovered by free_fleet_server over robot_state
free_fleet_server publishes fleet_state to our custom fleet_adapter
customer fleet_adapter passes fleet_state to rmf_fleet_adapter
rmf_fleet adapter pushes fleet_state to the api_server

Issues:

  1. If a robot X dies or is powered off, the rmf_fleet_adapter keeps publishing stale data over websocket to the api_server
  2. No issues when X joints back, stale data is updated and everything is ok
  3. Issue is when X dead or shutoff, the api_server’s /fleets end point keeps reporting there is still a robot in the fleet

Consequences:
rmf-web - robot stays visible in rmf-web ( :3000/robots) UI (never goes away)
/fleets endpoint keeps reporting stale data

Custom applications that need the actual fleet state cannot implement robust features.
For e.g. a user sends a task to the fleet but there’s no robot in the fleet. /fleets still says robot is available so send the task.

This is another bug btw - no robot in fleet. task stays queued forever. Cannot cancel/cannot kill.

The decommision api is interesting and could satisfy my purpose but this issue is not really about decommissioning as I see it.

Experiments:
I implemented methods to remove robots in the free_fleet_server after a timeout, e.g. robot_state not received in X minutes.
Forcing a robot out, as in my earlier report, obviously does not work.

Left is websocket with stale data and right is the fleet_state topic

Clarification: I dont need a robot to be permanently removed - while that case is still interesting because in field deployments, a robot may be temporarily or permanent swapped with a replacement.

In my current case its the same robot that gets shutdown every evening and rejoins the next day but the RMF thinks the robot is still there. This bottlenecks all sorts of user-level features, e.g. letting the user know the robot is available or not.

A possible solution is to modify the rmf_fleet_adapter’s FleetUpdateHandle.cpp

for (const auto& [context, mgr] : task_managers)
{
const auto& name = context->name();
nlohmann::json& json = robots[name];
  // ... builds JSON fleet state to API_SERVER
}

There is no remove option here.

Options are:

  1. Adding a remove_robot() method that erases from robots from task_managers

  2. Implementing timeout logic to detect inactive robots

  3. Or listening to FleetState messages and removing robots that are no longer present

Are any of these feasible?

This is the main purpose of the decommissioning feature: When a robot is decommissioned, the fleet adapter will no longer assign tasks to it until it gets recommissioned.

no robot in fleet. task stays queued forever. Cannot cancel/cannot kill.

I might need you to elaborate on why the task cannot be cancelled. I can’t think of anything that would interfere with task cancellation.

Are any of these feasible?

As far as I can figure the commissioning/decommissioning feature should already be what you need, except maybe we should add another field to the Commission API to toggle how the robot appears within the fleet state. With that you’d get the following settings:

  • Toggle whether the robot can receive new tasks.
  • Have the robot’s existing tasks cancel or be redistributed to other robots in the fleet.
  • Set the robot to Offline in the fleet state. Toggle whether the robot’s location and other properties like battery or issues are included in the fleet state. With this you could have a toggle on your dashboard to hide Offline robots. I don’t want to remove these robots from the fleet state completely because there could be mutex groups or issues lingering, and it would always be good to be able to verify the commissioning settings of an offline robot.

Does that cover your needs?

One thing I’m not completely sure about is whether free_fleet itself should be doing anything to handle robots that are decommissioned. Maybe @Aaron_Chong would have some thoughts on that?

Hi there! Apologies for the late reply, I’ll try to answer and provide more context,

The legacy free_fleet implementation for Humble, the free_fleet_server specifically, has the unfortunate assumption that the robots remain online all the time. While the feature of decommissioning or having robots going offline/decommissioned, was only properly supported in the current generation where it is closely tied to the EasyFullControl fleet adapter.

In short, the legacy implementation as it was, is unable to handle the circumstance as you mentioned, of taking a robot out of a fleet.

One thing I’m not completely sure about is whether free_fleet itself should be doing anything to handle robots that are decommissioned.

No. With the legacy implementation in mind, after decommissioning a robot and physically shutting it down, the fleet state will continue to be populated with the decommissioned robot’s last state, but the fleet adapter should already be ignoring it for task dispatches.

rmf-web - robot stays visible in rmf-web ( :3000/robots) UI (never goes away)
/fleets endpoint keeps reporting stale data

rmf-web , the api-server specifically, works with a DB directly, and hence the /fleets endpoint will provide the latest information it receives from fleet adapters, which will include the last reported state of the decommissioned robot, however the robot state should be reflected as decommissioned.

The robot map is populated with the latest robot location, and hence it will still be present in the map. This is an intended feature, as it can help inform operators where a faulted robot or disconnected robot was last located.

One feature that could be contributed to rmf-web would be to filter out robots from view. We’d be happy to review it if you and your team would like to look into this feature!

This is another bug btw - no robot in fleet. task stays queued forever. Cannot cancel/cannot kill

This is the same as the fleets endpoint. By design, the dashboard only ever displays the most recent task state, and therefore if a task is killed or the robot performing the task is offline, the task will never get new updates anymore and will stay as underway, and can’t be cancelled or killed. The dashboard will however show that a task is staleif the most recent state update is a long time ago.

Are any of these feasible?

I’d say decommissioning the robots is the most straightforward way. There is even the decommissioning endpoint too, so you can have chron script that decommissions robots via the endpoint.