If one RMF application instance can work properly on AWS across multiple availability zones? (#586)

Posted by @luyiluis:

I have a question regarding the RMF application deployment, if we deploy one RMF application instance on AWS (Elastic Container Service) with multiple availability zones, if it still can work properly with robots sending data to RMF on the cloud?

If i’m not wrong, in this case, AWS will deploy the same RMF application as multiple RMF instances across the different availability zones to support high availability of the application, but the current RMF seems cannot support one robot interfacing with multiple RMF core instances even if they are with same ROS domain id and DDS network?

Please kindly advise what’s the recommended high availability solution for RMF? thanks

Posted by @akash-roboticist:

Hi @luyiluis , High Availability in itself is a characteristic of a system, via use of one or more strategies eg. Load balancing (Active-active), Failover (Active-passive), Replication, etc.

For stateful applications like Open-RMF, due to their very nature, limits the strategies (out of the box) that may be used for HA. Failover (Active-passive) strategy can be used though (eg. spinup of node on backup server, when primary fails), with some acceptable loss of data, eg. traffic data, fleet adapter interim states, etc.

There are however quite a few components that are not limited by such state constraints, eg. API server, database, etc - these may be implemented using either of the many available HA strategies (eg. FastAPI load balancer setup guide, PostgreSQL HA guide ) or by use of managed / hosted services like those by AWS that may be HA out of the box.

However, if your question was related to horizontal scaling i.e. a robots can connect to any of the many instances and robots connecting to different instances must work together - is currently not possible.

Posted by @mxgrey:

the current RMF seems cannot support one robot interfacing with multiple RMF core instances even if they are with same ROS domain id and DDS network?

Can you elaborate on what you would like to accomplish with having a robot interface with multiple “RMF core instances”? Please be as specific as you can. For example, what do you consider a single “RMF core instance” to be? Does “one instance” mean one traffic schedule node? Does it mean one traffic schedule node plus some set of fleet adapters? An Open-RMF system consists of many different nodes (executables) working together, so I don’t think there’s an obvious single entity that can be referred to as an “RMF core instance”.

One of the core motivations for people to use Open-RMF is the mobile robot negotiation system. That negotiation system requires all the robots’ plans to be channeled into one traffic schedule node which can then identify conflicts between their plans. If you have multiple traffic schedule nodes and robots can willy-nilly choose which one they talk to, then the traffic negotiation system will fail to recognize some conflicts, leading to traffic jams in the physical system.

To have load balancing for the traffic system it would need to be possible for multiple traffic schedule nodes to stay in sync with each other. As far as I can figure that would be no more efficient than having all the fleet adapters talk to the same traffic schedule node. It’s also been my impression that communication between the fleet adapters and the traffic schedule has not been a bottleneck for any deployments.

That being said, we’re open to designing these Open-RMF nodes in a way that supports load balancing, but we would need guidance on the following points:

  1. What are the characteristics of a deployment that needs load balancing? E.g. Does the server get overloaded with network traffic when a certain number of robots are present? Do you have multiple buildings in your deployment, each running their own traffic schedules, but you want to interface with them over the same dashboard?
  2. What is the topology of your deployment? E.g. Do you have multiple cloud instances communicating over a network? Multiple docker containers within one cloud instance? How are nodes organized across these different instances or containers?
  3. What would be an efficient algorithm for synchronizing across the different balancing nodes? This is likely the hardest question to answer.

Edited by @mxgrey at 2025-02-04T03:54:21Z