Posted by @jeff994:
Not sure if anybody noticed that RMF_API_Server has some limitation on the number of core it can use:
- The rest api end points are implemented use python Asyncio library which is designed to use only 1 cpu core by default.
- The ros node inside the rmf_api_server is a separate process , it can use another core
Spotted this when when we heavily invokes the rest api endpoints and websockets (_internal endpoints). We observed that only 2 CPU core is used. The primary one can reach very close to 100% and after that websokets connection warning will be shown in the api server logs and rmf web dashboard lost response as well
Posted by @aaronchongth:
Hey @jeff994! This observation was mentioned during our Open-RMF community meeting. Can I confirm that this situation occurs when the task state update messages (over websockets) are very large? (This happens when an RMF task has been running for a very long time, sometimes due to the robot being stuck)
During the discussion, it was noted that with task state update messages being very very large, CPU usage will increase due to the json parsing on the API server. Therefore increasing CPU counts to be used, will not alleviate the issue. It was noted that during normal operating scenarios, where tasks are ongoing and not stuck, the CPU usage of the API server is rather low, and have plenty of headroom.
We discussed a few possible solutions,
- implement a feature which allows the fleet adapter to only send the updates of the task state update message, instead of an entire snapshot. This will keep the messages small, but have the potential of dropping or missing out changes. This might or might not be desirable depending on the deployment requirements
- find a way to curb the number of updates over time, to prevent large messages when tasks are stalled or stuck
- investigate a way to allow the fleet adapter to write updates directly into a backend, which the API server queries
Posted by @jeff994:
@aaronchongth Thanks for your reply. We’re not sure it is only related to the websockets ProcessMessage. We tried to move the websockets process message function to use multiple process solution which enables it use other cores. However, we still see the rmf_api_server still full utilized 1 core when webcokets lost response when receiving json message . We suspected (but not confirmed) there’re some other rest api end points which uses a lot of CPU.
Edited by @jeff994 at 2025-02-12T06:05:05Z
Posted by @jeff994:
Found one of the main contributor to the message length is the task states events. During our testing, we found that the phases events can reach more 2.4k for a single phase.
Edited by @jeff994 at 2025-02-19T01:20:37Z
Posted by @aaronchongth:
Thanks for confirming. Yes we have seen this before, but only in the case where the robot is stuck for an extended period of time without resolving the issue.
We discussed a few possible solutions,
- implement a feature which allows the fleet adapter to only send the updates of the task state update message, instead of an entire snapshot. This will keep the messages small, but have the potential of dropping or missing out changes. This might or might not be desirable depending on the deployment requirements
- find a way to curb the number of updates over time, to prevent large messages when tasks are stalled or stuck
- investigate a way to allow the fleet adapter to write updates directly into a backend, which the API server queries
As mentioned, there are several ways forward. This has been mentioned in the community meeting and we are still evaluating solutions and fixes at the moment.
For now, the most straightforward fix might be to figure out why the robot is getting stuck, resulting in the number of messages, and resolving that by perhaps optimizing the RMF navigation graphs or ensuring blockages don’t happen.