Fleet adapter segmentation fault after task assignment (#205)

Posted by @cwrx777:

Hi,

I am encountering segmentation fault as seen in the fleet adapter log below.
This happened after a task was dispatched using:

ros2 run rmf_demos_tasks dispatch_go_to_place -F DeliveryRobot  -R DeliveryRobot_1 -p location_A --use_sim_time
Json msg payload: 
{
  "type": "robot_task_request",
  "robot": "DeliveryRobot_1",
  "fleet": "DeliveryRobot",
  "request": {
    "category": "compose",
    "description": {
      "category": "go_to_place",
      "phases": [
        {
          "activity": {
            "category": "go_to_place",
            "description": {
              "waypoint": "location_A"
            }
          }
        }
      ]
    },
    "unix_millis_earliest_start_time": 0
  }
}
Got response:
{'state': {'assigned_to': {'group': 'DeliveryRobot', 'name': 'DeliveryRobot_1'}, 'booking': {'id': 'direct_2c62d134-cf75-4735-a60a-5c7c454549ca', 'unix_millis_earliest_start_time': 0}, 'detail': {'category': 'go_to_place', 'phases': [{'activity': {'category': 'go_to_place', 'description': {'waypoint': 'location_A'}}}]}, 'dispatch': {'status': 'queued'}, 'status': 'queued'}, 'success': True}

Could anyone provide any hints on the cause and how to debug the fleet adapter (and its C++ libraries) using Visual Studio Code?
The fleet adapter is based on the python template.

fleet adapter log:

[fleet_adapter_mircw-1] [INFO] [1661833127.926742550] [DeliveryRobot_cmd_hndl]: Robot [DeliveryRobot_1] has reached its target waypoint
[fleet_adapter_mircw-1] [WARN] [1661833255.017844045] [DeliveryRobot_fleet_adapter]: Unable to estimate final state for direct task request [direct_2c62d134-cf75-4735-a60a-5c7c454549ca]. This may be due to insufficient resources to perform the task. The task will be still be added to the queue.
[fleet_adapter_mircw-1] [INFO] [1661833255.023639645] [DeliveryRobot_fleet_adapter]: Direct request [direct_2c62d134-cf75-4735-a60a-5c7c454549ca] successfully queued for robot [DeliveryRobot_1]
[fleet_adapter_mircw-1] [INFO] [1661833255.361641290] [DeliveryRobot_cmd_hndl]: Requesting robot to stop...
[fleet_adapter_mircw-1] [INFO] [1661833255.363074190] [DeliveryRobot_cmd_hndl]: [DeliveryRobot_1]-API stop
[ERROR] [fleet_adapter_mircw-1]: process has died [pid 1370612, exit code -11, cmd '/home/user0001/ws_rmf/rmf/install/fleet_adapter_mircw/lib/fleet_adapter_mircw/fleet_adapter_mircw -c /home/user0001/ws_rmf/rmf/install/user0001_demos/share/user0001_demos/config/maze/fleet_adapter_mircw_config.yaml -n /home/user0001/ws_rmf/rmf/install/user0001_demos_maps/share/user0001_demos_maps/maps/maze/nav_graphs/0.yaml -s ws://localhost:7878 --use_sim_time --ros-args'].

in syslog:

Aug 30 12:20:58 foscaVM kernel: [223501.501138] fleet_adapter_m[1370891]: segfault at 0 ip 00007f909bec2854 sp 00007f908effb640 error 4 in librmf_task_sequence.so[7f909be87000+41000]

Posted by @mxgrey:

When debugging a fleet adapter segmentation fault that involves Python integration, I usually use gdb, like this:

gdb --args python3 /home/user0001/ws_rmf/rmf/install/fleet_adapter_mircw/lib/fleet_adapter_mircw/fleet_adapter_mircw -c /home/user0001/ws_rmf/rmf/install/user0001_demos/share/user0001_demos/config/maze/fleet_adapter_mircw_config.yaml -n /home/user0001/ws_rmf/rmf/install/user0001_demos_maps/share/user0001_demos_maps/maps/maze/nav_graphs/0.yaml -s ws://localhost:7878 --use_sim_time

You’ll need to make sure to compile rmf_fleet_adapter in debug mode. Sometimes I find it helpful to compile the entire RMF workspace in debug mode in case the issue is in a lower level library.

Unfortunately running gdb directly in the terminal the way I’m suggesting means you’ll need to navigate the stack trace using gdb terminal commands which are not very easy to use. I don’t know if Visual Studio Code offers any gdb plugins to make that easier, but I’d be very interested if you can find something.

As for what could cause the crash, I find these crashes are almost always caused by trying to dereference an invalid reference or trying to use an out-of-bounds index of a container. Perhaps there’s somewhere that you’ve passed a None into the RMF API bindings when it’s expecting a valid object reference?

Posted by @cwrx777:

Hi @mxgrey ,
I managed to find where the fault happened.

  1. make_model in here returns null. and this caused the fault.
  2. estimate_duration in here returns null.
  3. result.disconnected() here returns null.
  4. and that is because of this

Posted by @mxgrey:

Thanks, that stack trace does shed some light on the situation.

What I would infer from that trace is the planner believes the destination of the task is completely unreachable for the robot. I expect that means at least one of the following is true:

  1. You have closed down too many lanes, so there aren’t enough open lanes remaining to connect the robot’s current location to its destination
  2. You have updated the position of the robot in such a way that it is stuck on some disconnected “island” within the nav graph

Are you using the lane closing feature? If not, are there any islands within your nav graph that the robot could be getting stuck in? One way an “island” can happen is if you have a one-way lane that leads to a part of the graph which has no lane to return on.

Of course regardless of the cause we should not be segfaulting. This should be propagated as some kind of error instead. I’ll make a PR that prevents this crash from happening.

Posted by @mxgrey:

As an update, this PR should prevent the crash from ever happening again.

The new behavior is the fleet adapter will log a task error which can be viewed from the dashboard, and will periodically attempt to replan for the robot. It will be up to the human operators to decide how to handle the task error. Some steps they could take include:

  • Cancel the task if the robot can no longer do it
  • Open up whatever closed lanes are preventing it from reaching the goal
  • Guide the robot to somewhere that it can successfully plan from