RMF-API-Server CPU number of core usage issue (#603)

Archive-o-matic · February 12, 2025, 3:28am

Posted by @jeff994:

Not sure if anybody noticed that RMF_API_Server has some limitation on the number of core it can use:

The rest api end points are implemented use python Asyncio library which is designed to use only 1 cpu core by default.
The ros node inside the rmf_api_server is a separate process , it can use another core

Spotted this when when we heavily invokes the rest api endpoints and websockets (_internal endpoints). We observed that only 2 CPU core is used. The primary one can reach very close to 100% and after that websokets connection warning will be shown in the api server logs and rmf web dashboard lost response as well

Archive-o-matic · February 12, 2025, 5:19am

Posted by @aaronchongth:

Hey @jeff994! This observation was mentioned during our Open-RMF community meeting. Can I confirm that this situation occurs when the task state update messages (over websockets) are very large? (This happens when an RMF task has been running for a very long time, sometimes due to the robot being stuck)

During the discussion, it was noted that with task state update messages being very very large, CPU usage will increase due to the json parsing on the API server. Therefore increasing CPU counts to be used, will not alleviate the issue. It was noted that during normal operating scenarios, where tasks are ongoing and not stuck, the CPU usage of the API server is rather low, and have plenty of headroom.

We discussed a few possible solutions,

implement a feature which allows the fleet adapter to only send the updates of the task state update message, instead of an entire snapshot. This will keep the messages small, but have the potential of dropping or missing out changes. This might or might not be desirable depending on the deployment requirements
find a way to curb the number of updates over time, to prevent large messages when tasks are stalled or stuck
investigate a way to allow the fleet adapter to write updates directly into a backend, which the API server queries

Archive-o-matic · February 12, 2025, 6:02am

Posted by @jeff994:

@aaronchongth Thanks for your reply. We’re not sure it is only related to the websockets ProcessMessage. We tried to move the websockets process message function to use multiple process solution which enables it use other cores. However, we still see the rmf_api_server still full utilized 1 core when webcokets lost response when receiving json message . We suspected (but not confirmed) there’re some other rest api end points which uses a lot of CPU.

Edited by @jeff994 at 2025-02-12T06:05:05Z

Archive-o-matic · February 19, 2025, 1:14am

Posted by @jeff994:

Found one of the main contributor to the message length is the task states events. During our testing, we found that the phases events can reach more 2.4k for a single phase.

Edited by @jeff994 at 2025-02-19T01:20:37Z

Archive-o-matic · February 20, 2025, 1:08am

Posted by @aaronchongth:

Thanks for confirming. Yes we have seen this before, but only in the case where the robot is stuck for an extended period of time without resolving the issue.

We discussed a few possible solutions,

implement a feature which allows the fleet adapter to only send the updates of the task state update message, instead of an entire snapshot. This will keep the messages small, but have the potential of dropping or missing out changes. This might or might not be desirable depending on the deployment requirements

find a way to curb the number of updates over time, to prevent large messages when tasks are stalled or stuck

investigate a way to allow the fleet adapter to write updates directly into a backend, which the API server queries

As mentioned, there are several ways forward. This has been mentioned in the community meeting and we are still evaluating solutions and fixes at the moment.

For now, the most straightforward fix might be to figure out why the robot is getting stuck, resulting in the number of messages, and resolving that by perhaps optimizing the RMF navigation graphs or ensuring blockages don’t happen.

Topic		Replies	Views
How to Use RMF-Web API Server and API Client for Custom Frontend Integration (#635) Open-RMF General rmf-github-discuss	5	16	March 17, 2025
Few questions regarding async and threading behaviours in rmf_demos_panel (#351) Open-RMF General rmf-github-discuss	2	2	May 18, 2023
Getting the final RMF destination (#581) Open-RMF General rmf-github-discuss	3	5	January 17, 2025
/task_summaries topic is no longer used? (#189) Open-RMF General rmf-github-discuss	3	4	May 1, 2024
Scalability issues with large number of nodes ROS General ros2	30	2387	April 4, 2024

RMF-API-Server CPU number of core usage issue (#603)

Related topics