Robot enters lift without session when its ongoing task was cancelled and assigned with another task (#549)

Posted by @cwrx777:

open-rmf release: iron 20231229

Issue summary: robot does enters lift without having lift session if while waiting for lift, robot current task is cancelled and assigned with another task.

Building yaml: hotel.building.yaml
Difference with original building yaml:

  1. remove lane to Lift2. This is so that only Lift1 is used.
  2. use unidirectional lanes to enter and exit Lift1

Robot starting position is set by using spawn_entity.launch.xml and robotplacer.launch.xml
Task is dispatched by using this modified dispatch_patrol.py

https://github.com/user-attachments/assets/abbb57fa-8633-46f7-a9fd-32a63f633ed5

Initial position:
tinyBot_1 is at Lift_LWP_L1 and tinyBot_2 is at Lift1_LWP_L2.
First, send tinyBot_2 to a location at L1.

ros2 run rmf_demos_tasks dispatch_patrol -F tinyRobot -R tinyBot_2 -p restaurant --use_sim_time -id task_1;

Then, a moment later, send tinyBot_1 to a location on L2. This is to make tinyBot_1 to keep requesting for the lift.

sleep 5;
ros2 run rmf_demos_tasks dispatch_patrol -F tinyRobot -R tinyBot_1 -p L2_room15 --use_sim_time -id task_2;

(refer to video at 00:25), when tinyBot_1 reach L1 and lift door start to open, cancel tinyBot_2’s task and send it to another location on L1.

ros2 run rmf_demos_tasks cancel_task -id task_2;
ros2 run rmf_demos_tasks dispatch_patrol -F tinyRobot -R tinyBot_1 -p restaurant --use_sim_time -id task_3;

Observation:
tinyBot_2 enters lift while tinyBot_1 is still inside.


Edited by @cwrx777 at 2024-10-25T16:17:37Z

Chosen answer

Answer chosen by @cwrx777 at 2024-11-18T00:26:49Z.
Answered by @mxgrey:

The bug has been found and fixed here: Fix lift disruption issue by mxgrey · Pull Request #393 · open-rmf/rmf_ros2 · GitHub

There are exactly two ways that the bug can occur:

  1. A task gets cancelled while the robot is waiting to use a lift, and a new task is assigned (such as the finishing task) which tells the robot to use the same lift right away.
  2. A negotiation or replan happens while the robot is waiting to use a lift, and the new plan tells the robot to use the same lift right away

I believe all the bugged cases that you’ve discovered match one of those categories, so the fix should cover everything.

That said, I would still recommend making sure that robots can always enter and exit lifts without a risk of conflicts occurring (e.g. by using mutex groups) since negotiating may still cause significant disruption during the somewhat brittle process of getting in and out of lifts.

Posted by @mxgrey:

Thanks for the detailed report. I’ve opened an issue ticket for this and we’ll try to reproduce and then debug it.

It looks like you’re using a release from December of last year, and several race conditions related to lifts and task cancellation have been put in since then, so I’ll start by doing an A/B test of the scenario with the current main branch vs the one you mentioned.

Posted by @mxgrey:

@cwrx777 I believe this behavior should already be fixed on the main branch by Add a timeout before automatically releasing lift by mxgrey · Pull Request #369 · open-rmf/rmf_ros2 · GitHub

I’ve opened Backport lift release fixes to Iron by mxgrey · Pull Request #390 · open-rmf/rmf_ros2 · GitHub to backport the behavior fixes to the iron branch. Could you try out that branch and let me know if it fixes the problem for you?

Posted by @cwrx777:

@mxgrey
The branch that I used for testing is actually this, which already include the mentioned fix.

Posted by @cwrx777:

I tried with the main branch for except for rmf_simulation, rmf_traffic_editor and rmf_demos as I’m still using Ubuntu22.04 and I encountered the same issue.

I also did another test as shown in the video below.

Initial positions:
tinyBot_1 is at Lift1_LWP_L1 and tinyBot_2 is at Lift1_LWP_L2.

First, send tinyBot_2 to L1 (00:10)

ros2 run rmf_demos_tasks dispatch_patrol -F tinyRobot -R tinyBot_2 -p restaurant --use_sim_time -id task_1;

Then when the lift is moving up, send tinyBot_1 to L2 (00:20)

ros2 run rmf_demos_tasks dispatch_patrol -F tinyRobot -R tinyBot_1 -p L2_room15 --use_sim_time -id task_2;

observation:
when the lift is moving to L1, tinyBot_1 starts moving into the lift. (00:32)

another observation is that the behaviour is affected by the distance between these two waypoints.


In the video below, the distance between the two waypoints on L1 and L2 are further.


Edited by @cwrx777 at 2024-10-30T16:12:48Z

Posted by @mxgrey:

I’ve managed to recreate the bad behavior. I believe it may be an edge case that only happens when the robot begins its task at a waypoint that is immediately in front of a lift.

To validate this, have you ever seen the behavior be bad when the robot starts from any waypoint besides one of the Lift1_LWP_[#] waypoints? So far in my testing I haven’t seen that happen.

If I’m right then I should be able to fix this with some start point filtering. Since tomorrow is a holiday in Singapore, I’ll try to have a fix ready for testing on Friday.

Posted by @cwrx777:

@mxgrey


I tried by placing tinyBot_1 at waypoint A and send it to L2 a moment after tinyBot_2 is tasked to go to L1. tinyBot_1 then moves to Lift1_LWP_L1 to wait for the lift.
after tinyBot_1’s current is was cancelled and the 2nd task is sent to the robot. tinyBot_1 retreats to point A to let tinyBot_2 to exit lift.

https://github.com/user-attachments/assets/a0eda65b-40bd-4490-96df-1db34395cab4

I think the issue is related how waypoint inside lift is used during deconflict. in the video with increased distance between the EXIT and LWP waypoints, as there is enough space, deconflict was not triggered.

Posted by @mxgrey:

Even without negotiation, it is not supposed to be possible for the robot to try to enter the lift without having the lift session locked in. The lift session is a totally separate mechanism from the traffic negotiation.

In my experiment I saw that the robot tried to drive into the lift without having the lift session locked, and this was the key problem. You can check this by running ros2 topic echo lift_states and looking at session_id. If the robot is trying to enter the lift when the session_id is not its name or when the lift is not at the correct floor with the doors open then the robot is misbehaving.

This problem was able to happen whether or not another robot was already in the lift, but it only happens when the robot is starting from the waypoint that immediately precedes the lift. Notice how in your latest video the robot is starting two waypoints away from the lift and then subsequently does not try to enter the lift until it has the lift session locked in.

Posted by @cwrx777:

Hi @mxgrey,
Do you manage to come up with the fix?

Posted by @mxgrey:

I’ve made enough progress to figure out that the problem isn’t what I thought it was.

It turns out that the plans being generated are correct and include an event to begin a lift session. But for some reason that event is being completely skipped while the plan is being executed. I suspect there’s a subtle bug when ExecutePlan generates the event sequence from the plan, but I haven’t fully confirmed this yet. Since I haven’t seen this kind of problem happen before I don’t have a good estimate for when I’ll have a solution.

In the meantime, to make sure this doesn’t happen I recommend taking two measures:

  • Use a single mutex group across all lanes and vertices inside a lift (on all floors) as well as outside of the lift (on all floors) up to the point where you want the robot to wait for the lift as well as up to the point where a robot exiting the lift will no longer be in the way of a robot that wants to enter.
  • Make sure you arrange these waypoints and lanes in a way that the robots will not have conflicts with each other or get deadlocked by the mutex.

It’s extra work but I strongly recommend doing this for any deployments to ensure that lift traffic is infallible. Otherwise you can experience very significant disruptions whenever conflicts happen near lifts. Even when I fix the bug that’s currently happening, it will be beneficial for you to have a rigorous mutex group with careful traffic flow around each lift.

Posted by @cwrx777:

@mxgrey
Thank you for taking the time to investigate this issue.

Posted by @mxgrey:

The bug has been found and fixed here: Fix lift disruption issue by mxgrey · Pull Request #393 · open-rmf/rmf_ros2 · GitHub

There are exactly two ways that the bug can occur:

  1. A task gets cancelled while the robot is waiting to use a lift, and a new task is assigned (such as the finishing task) which tells the robot to use the same lift right away.
  2. A negotiation or replan happens while the robot is waiting to use a lift, and the new plan tells the robot to use the same lift right away

I believe all the bugged cases that you’ve discovered match one of those categories, so the fix should cover everything.

That said, I would still recommend making sure that robots can always enter and exit lifts without a risk of conflicts occurring (e.g. by using mutex groups) since negotiating may still cause significant disruption during the somewhat brittle process of getting in and out of lifts.


This is the chosen answer.