RMF Traffic occasionally fails to manage multiple robots headed to the same destination (#162)

Posted by @jkeshav-bvignesh:

Hi Team,
In continuation with discussion https://github.com/open-rmf/rmf/discussions/147, I had been observing some strange behaviors in the RMF Traffic module. When multiple robots are instructed to go to the same target destination at the same time, RMF often manages to successfully negotiate the conflicts that occur and complete all the allocated tasks. Robots do pause or slow down and move in and out of the target destination.

But occasionally, these negotiations don’t happen as expected. Below is one such scenario where 6 robots (5 delivery robots and 1 tiny robot) were tasked to go to the same waypoint using go_to_place from task_api_requests. RMF has been able to successfully handle this exact scenario multiple times in simulation. This video showcases a series of failures that occured during one particular run. But I am not sure if these failures are related or not.

(The video has been sped up to fit the size restrictions. Please decrease the playback speed, if required)

The major issues that can be observed are:

  1. The Green/Yellow marker (Expected location) and Purple marker (Actual robot location) go out of sync quite early
  2. Two delivery bots collide with each other. Not sure if this due to the marker mismatch
  3. The Green/Yellow markers completely go out of sync and disappear towards the end of the video. (I believe the full control node dies at this point)
  4. One robot can be observed to move out of the grid while navigating (at 00:06)

While this is quite rare in simulation, we often observe marker mismatches when using real robots. @dennis-thevara’s discussion https://github.com/open-rmf/rmf/discussions/159 is related to that. It could still be the case that these are two different issues altogether. But coordination between robots ends up failing in both cases. The simulation only uses the components that are part of the RMF Core in a custom environment.

Based on this,

  1. Is our understanding regarding the Markers correct or do they also signify something else?
  2. Is this a cascading failure scenario or are they unrelated?
  3. What could be causing this failure?

On a related note, what does a MirrorManager do?
I also often get these warnings from various nodes. Is it related to this?

1655453323.0849385 [full_control-19] [WARN] [1655453323.084637004] [cleanerBotA_fleet_adapter]: Failed to update using patch for DB version 1538; requesting new update
1655453323.0855885 [full_control-15] [WARN] [1655453323.084657292] [tinyRobot_fleet_adapter]: Failed to update using patch for DB version 1538; requesting new update
1655453323.0860090 [schedule_visualizer-5] [WARN] [1655453323.084798741] [rmf_visualization_schedule_data_node]: Failed to update using patch for DB version 1538; requesting new update
1655453323.0863543 [rmf_traffic_schedule_monitor-2] [WARN] [1655453323.084986126] [rmf_traffic_schedule_backup]: Failed to update using patch for DB version 1538; requesting new update

RMF was build from source using the latest packages pulled on May 25th using vcs import.

Here is the associated log file: RMFTrafficFailure.log

Chosen answer

Answer chosen by @jkeshav-bvignesh at 2022-06-24T06:48:45Z.
Answered by @mxgrey:

As of right now, RMF is not expected to be able to correctly or successfully negotiate traffic for multiple robots that want to reach the same destination at the same time. When multiple robots want to reach the same destination simultaneously, it is no longer a “traffic” issue so much as a logical inconsistency issue because the overall desired end state for the traffic planning problem is invalid. Sometimes the negotiation system does manage to sort things out, but that is mostly a matter of luck and should not be expected or relied on.

Of course there are bound to be cases where the immediate goal of multiple robots requires them to reach the same destination as each other, for example if they need to use the same lift or they need to pick up an item at the same pick up point. In our view, this is not a traffic problem as much as it is a resource allocation problem where the “resource” is the right to access the physical location.

We would like to develop a reservation system that ensures only one robot at a time will try to approach any location on the map as its goal. An issue ticket was opened for this here some time ago. While that ticket only talks about reserving parking spots, we would use the same reservation system to delegate the order that robots have access to a location. We have a slide on the idea here, and an adjacent idea for a queuing system here.

In the general case, every time a robot has a destination that it is trying to reach…

  1. It will send a request to the reservation system asking for the right to go to the location.
  2. The reservation system will check if any other agents have that location reserved (or will have it reserved by the time the requester reaches it) and then let the requester know if it is available, or provide an estimate of when it should be available if it is not currently.
  3. If the location is not yet available, the robot would place an additional request for a nearby parking spot from the reservation system. This request will include a list of parking spots that the robot could use, with each item in the list ranked according to preferability (e.g. how close the parking spot is to the real destination).
  4. Once a parking spot is assigned, the robot will move to that parking spot to wait until its real destination is available. Its request to reserve its real destination remains in the memory of the reservation system, queued up alongside requests from other robots to use that same space.
  5. When the space is available for the robot, the reservation system will issue a signal to indicate that it is now reserved for the robot.
  6. The robot moves from its parking spot (or wherever it happens to be) towards its real destination. At the same time it releases the parking spot that it had reserved for the sake of waiting.

For specific cases where a system integrator can anticipate a bottleneck where many robots will want to access the same resource at the same time (e.g. a door, lift, workcell) the system integrator can define a queuing area for the robots to wait in for that specific resource. The above workflow would change at item (3) and the robot would reserve a spot in the queuing area instead of reserving a generic parking spot. If the queuing area is full then the robot would fall back on the general strategy of reserving any nearby parking spot.

While we’ve put a lot of thought into the design of these systems, they are not currently being funded, so unfortunately they are not being actively developed at the moment.

Posted by @mxgrey:

As of right now, RMF is not expected to be able to correctly or successfully negotiate traffic for multiple robots that want to reach the same destination at the same time. When multiple robots want to reach the same destination simultaneously, it is no longer a “traffic” issue so much as a logical inconsistency issue because the overall desired end state for the traffic planning problem is invalid. Sometimes the negotiation system does manage to sort things out, but that is mostly a matter of luck and should not be expected or relied on.

Of course there are bound to be cases where the immediate goal of multiple robots requires them to reach the same destination as each other, for example if they need to use the same lift or they need to pick up an item at the same pick up point. In our view, this is not a traffic problem as much as it is a resource allocation problem where the “resource” is the right to access the physical location.

We would like to develop a reservation system that ensures only one robot at a time will try to approach any location on the map as its goal. An issue ticket was opened for this here some time ago. While that ticket only talks about reserving parking spots, we would use the same reservation system to delegate the order that robots have access to a location. We have a slide on the idea here, and an adjacent idea for a queuing system here.

In the general case, every time a robot has a destination that it is trying to reach…

  1. It will send a request to the reservation system asking for the right to go to the location.
  2. The reservation system will check if any other agents have that location reserved (or will have it reserved by the time the requester reaches it) and then let the requester know if it is available, or provide an estimate of when it should be available if it is not currently.
  3. If the location is not yet available, the robot would place an additional request for a nearby parking spot from the reservation system. This request will include a list of parking spots that the robot could use, with each item in the list ranked according to preferability (e.g. how close the parking spot is to the real destination).
  4. Once a parking spot is assigned, the robot will move to that parking spot to wait until its real destination is available. Its request to reserve its real destination remains in the memory of the reservation system, queued up alongside requests from other robots to use that same space.
  5. When the space is available for the robot, the reservation system will issue a signal to indicate that it is now reserved for the robot.
  6. The robot moves from its parking spot (or wherever it happens to be) towards its real destination. At the same time it releases the parking spot that it had reserved for the sake of waiting.

For specific cases where a system integrator can anticipate a bottleneck where many robots will want to access the same resource at the same time (e.g. a door, lift, workcell) the system integrator can define a queuing area for the robots to wait in for that specific resource. The above workflow would change at item (3) and the robot would reserve a spot in the queuing area instead of reserving a generic parking spot. If the queuing area is full then the robot would fall back on the general strategy of reserving any nearby parking spot.

While we’ve put a lot of thought into the design of these systems, they are not currently being funded, so unfortunately they are not being actively developed at the moment.


Edited by @mxgrey at 2022-06-23T16:54:27Z


This is the chosen answer.

Posted by @jkeshav-bvignesh:

Thanks for clearing this up! Although you say that its a matter of luck, I mostly see successful negotiations in simulation at least :grinning_face_with_smiling_eyes: . This made me think that the above scenario was a serious failure. I also see that the rate of successful negotiations in these cases tend to increase with an increase in availability of alternate waypoints. So is it correct to say that the negotiation system has a limited capability to handle the cases where the immediate goal is the same waypoint?

But I am still not clear about the strange behaviours of the markers here. What is the most probable explanation for the visualisation? What leads to the markers finally moving completely out of sync and disappearing? Was the collision of the two delivery robots related to this?

Posted by @mxgrey:

I also see that the rate of successful negotiations in these cases tend to increase with an increase in availability of alternate waypoints.

That doesn’t surprise me, since it gives more opportunities for the traffic system to fudge a “solution”.

So is it correct to say that the negotiation system has a limited capability to handle the cases where the immediate goal is the same waypoint?

I would take this statement further and say that the negotiation system should never be given situations where multiple robots have the same immediate destination waypoint. The fact that it might find a solution is more of a lucky accident than a feature, even if it can happen frequently. Unfortunately we won’t have a way to prevent multiple robots from having the same destination inside of RMF until the reservation system is developed.

But I am still not clear about the strange behaviours of the markers here. What is the most probable explanation for the visualisation?

When the traffic schedule node is not receiving updates from the fleet adapter, the traffic schedule will naively assume that the robot is moving along according to the prediction that it provided earlier. When the schedule markers move out of sync like this it generally means that the fleet adapter is getting hung up somehow. I think the most likely explanation is that the fleet adapter is furiously trying to find a successful negotiation for an unsolvable scenario and choking up all of its threads for the sake of that effort.

The multi-threaded implementation of the fleet adapter generally allows it to solve traffic situations very quickly without disrupting the rest of its operations, but unfortunately we don’t currently have a way to isolate the thread used by the main event loop of the fleet adapter (which deals with state + traffic updates) from the worker threads used for negotiation. In cases where an impossible negotiation is going on it’s possible that the main event loop thread will get grabbed by the negotiator, blocking the main event loop from running for some period of time.

That should probably be solved by having a dedicated thread pool specifically for the workers and a single separate thread that is entirely dedicated to the main event loop. But in situations where robots aren’t trying to reach the same destination, the negotiations will usually finish quickly enough that we don’t see this issue anyway.

Posted by @jkeshav-bvignesh:

That makes sense.

but unfortunately we don’t currently have a way to isolate the thread used by the main event loop of the fleet adapter (which deals with state + traffic updates) from the worker threads used for negotiation

So is this why the task statuses also don’t indicate a failure? When I was monitoring the task summaries using the websocket, the only states I saw were ‘Standby’, ‘Underway’ or ‘Completed’. I would expect one of the error states to have been returned at this point. RMF does return ‘failed’ in some other scenarios. What are some possible scenarios were different error states are returned?

I did notice that no API Responses are published in \task_responses when RMF fails to plan for a task. But I am not sure if this is a consistent behaviour.

Posted by @mxgrey:

Failure to negotiate the traffic won’t show up as a task failure, because the negotiation will just be retried a little while later. We could report failed negotiations in the robot state, but that hasn’t been necessary yet since sometimes negotiations fail temporarily for other reasons and will manage to succeed after a retry.

The causes of a failed task status are generally defined per task activity (e.g. go_to_place, pickup, dropoff, clean each have their own conditions for determining if they succeed or fail), but if a task fails during a go_to_place activity then that typically means that one or more lanes in the robot’s navigation graph has been closed, leaving the robot unable to reach its destination. If those lanes were closed during the task bidding process then the fleet adapter would have rejected the task, preventing it from being assigned. That could give a task failure status right after the task request is submitted.

Typically if a task failure happens there should be some task log message explaining why. If you see a situation where a task failure is reported but the task log offers no explanation, please do report that to us so we can make the cause of failure more clear.

Posted by @jkeshav-bvignesh:

Sure. Will do! Thanks for all the clarifications!

I did notice that no API Responses are published in \task_responses when RMF fails to plan for a task. But I am not sure if this is a consistent behaviour.

Just to clarify, this was regarding the initial task planning. A couple of times there were no ApiResponse / DispatchState messages being sent when RMF could create a task plan for a new task request. Although the logs had task <id> has no submissions during bidding. Dispatching failed, and the task will not be performed. or Unable to generate a valid request for direct task or a similar message. The dispatch_go_to_place demo script from rmf_demo_tasks also relies on a timeout in such situations.

Is this the expected flow? A message response would be better for users to make further decisions, right?

This is mostly the response from /dispatch_states. /task_responses is empty.

$ ros2 topic echo /dispatch_states
active: []
finished: []
---
active: []
finished: []
---
active: []
finished: []
---

Posted by @mxgrey:

A message response would be better for users to make further decisions, right?

Totally agree! This PR gives the task dispatcher the ability to update task states as failed when an auction fails: ws broadcast client in dispatcher node by youliangtan · Pull Request #212 · open-rmf/rmf_ros2 · GitHub

Once that’s merged you should see the desired behavior.

Posted by @jkeshav-bvignesh:

Perfect!

Posted by @mxgrey:

On a related note, what does a MirrorManager do?

The MirrorManager just helps a node to keep its traffic schedule information in sync with the upstream database that’s maintained by the rmf_traffic_schedule node.

I also often get these warnings from various nodes. Is it related to this?

Those warnings should generally be ignored. They tend to print out when there hasn’t been any traffic activity in a few seconds. The warnings were put in to help us notice cases where a connection was lost or the traffic schedule node died, but the warnings are more aggressive than they really need to be. We could remove them entirely and come up with smarter ways to evaluate the health of the connection with the traffic schedule node.

Posted by @jkeshav-bvignesh:

That’s good to know. The schedule database is also currently an in-memory implementation right? I did see instructions for running a postgres docker here: GitHub - open-rmf/rmf-web
Is that in anyway related to the schedule database or is it only for RMF Web functionalities?

Posted by @mxgrey:

The schedule database is more of a “living” database than a traditional database which would be a long-term repository for data. So yes, you’re exactly right, it is entirely in memory. We don’t make any effort to save it to disk for two reasons:

  1. Changes to the database are so frequent that the write-to-disk operations would be exorbitant
  2. If the schedule node ever dies or resets, there is a fail over mechanism which has all the fleet adapters to send their latest predictions to the new schedule node. There is no value to restoring that information from disk because any traffic predictions which are still relevant will be resent by this mechanism. The real authorities of the traffic schedule data are the distributed fleet adapters; the traffic schedule node is just collecting and redistributing that information.

The postgres database is related to logging task states and robot states.

Posted by @[Missing data]:

Is there any update about this issue?
I have two robots, and I use go_to_place to request the two robots to the same destination . I observed that conflict was resolved finally after several retries. So the test result is the two robots collided with each other.
I wondered whether the scenario I observed was correct behavior of rmf.

Posted by @[Missing data]:

And I have another question,
When I dispatch a task to a specific robot, can I invoke the RobotUpdateHandle.submit_direct_request directly?
I found no difference when I submit task by the following two ways:

  1. invoke RobotUpdateHandle.submit_direct_request directly
  2. publish ApiRequest message to topic task_api_requests

Could you please explain to me about the difference between the above two kinds of ways?
Thank you very much!!!

Posted by @mxgrey:

RobotUpdateHandle.submit_direct_request is equivalent to using the direct request API. Direct requests are different from dispatch requests:

  • Dispatch request: let RMF choose the optimal robot to complete a task
  • Direct request: specify which robot should complete a task

Both direct and dispatch requests can be sent from the dashboard or the command line. Direct requests can additionally be produced using RobotUpdateHandle.submit_direct_request.

Posted by @[Missing data]:

Two node are subscribing topic task_api_requests, one is the fleet adapter node ,the other is rmf_dispatcher_node.
When receiving request from task_api_requests, what do the two nodes do separately?

Posted by @mxgrey:

The fleet adapter node listens so that it notices direct task requests. The rmf_dispatcher_node listens so that it knows when to dispatch a task requests.

Both types of tasks are sent over the task_api_request topic but each type needs to be handled by different nodes.


Edited by @mxgrey at 2023-01-19T03:41:20Z