Highy available open-rmf and adapters (#78)

Posted by @cwrx777:

Hi,

Can open-rmf core, supporting nodes, and the adapters be made highly available or load balanced? or is it already implemented to a certain extend?

Chosen answer

Answer chosen by @cwrx777 at 2021-07-22T05:14:13Z.
Answered by @mxgrey:

We certainly care a lot about ensuring high availability, since loss of services will have a huge negative impact on business operations. We recently merged a feature to support seamless fail over for the traffic schedule node, which is a critical single-point-of-failure component for the traffic management system. There’s still a little more work to be done to cover all possible failure modes for the traffic schedule node, but that effort is already under way. As long as that component can fail over gracefully, there shouldn’t be anything else that can bring down the traffic management system.

We also have plans to work on graceful fail over for fleet adapters so that task management can operate without any risk of interruption, but that effort has not started yet. For any other sub-systems in RMF, we’ll need to look at them on a case-by-case basis to determine a good fail over strategy. There are many different kinds of systems at play, so I don’t expect to get a one-size-fits-all solution.

Regarding load balancing, we make a point to keep the communication and processing extremely efficient, so I expect it to scale well enough that explicit load balancing of specific services won’t be needed, at least for the traffic management system. For one thing, the calculation of motion plans is already distributed across many processes by design, and the negotiation to resolve traffic conflicts is done peer-to-peer rather than bottle-necked by a single process.

There is one potential bottle-neck I can think of in the system, which is the traffic schedule node’s conflict detection. The traffic schedule node takes in all the traffic plans of all the robots and compares them against each other to identify upcoming traffic conflicts. However, this isn’t really as bad as it might sound, because that conflict detection takes place in its own thread, and after each cycle of checking is finished, it grabs a snapshot of the latest state of the whole schedule to do another round of checking. So even if it gets flooded with changes in between conflict-detection cycles, it will simply jump itself to the latest schedule version. So even if it falls behind to some degree, it will always catch back up. It may be plausible to load balance this responsibility by creating multiple traffic schedule nodes and designating each node to check conflicts for a subset of the robots, but I don’t think that will be necessary in the near future, and it can be added later without any modification to any of the RMF APIs or specifications. So I don’t plan on pursuing that until benchmarks and user requirements start to indicate that it will be worthwhile.

Posted by @gbiggs:

We are currently undergoing a push to add redundancy and robustness-to-failure to OpenRMF. Some of the nodes used by the core are already redundant in the main branch. I don’t think we have any plans at the current time to enable load balancing, however.

Posted by @mxgrey:

We certainly care a lot about ensuring high availability, since loss of services will have a huge negative impact on business operations. We recently merged a feature to support seamless fail over for the traffic schedule node, which is a critical single-point-of-failure component for the traffic management system. There’s still a little more work to be done to cover all possible failure modes for the traffic schedule node, but that effort is already under way. As long as that component can fail over gracefully, there shouldn’t be anything else that can bring down the traffic management system.

We also have plans to work on graceful fail over for fleet adapters so that task management can operate without any risk of interruption, but that effort has not started yet. For any other sub-systems in RMF, we’ll need to look at them on a case-by-case basis to determine a good fail over strategy. There are many different kinds of systems at play, so I don’t expect to get a one-size-fits-all solution.

Regarding load balancing, we make a point to keep the communication and processing extremely efficient, so I expect it to scale well enough that explicit load balancing of specific services won’t be needed, at least for the traffic management system. For one thing, the calculation of motion plans is already distributed across many processes by design, and the negotiation to resolve traffic conflicts is done peer-to-peer rather than bottle-necked by a single process.

There is one potential bottle-neck I can think of in the system, which is the traffic schedule node’s conflict detection. The traffic schedule node takes in all the traffic plans of all the robots and compares them against each other to identify upcoming traffic conflicts. However, this isn’t really as bad as it might sound, because that conflict detection takes place in its own thread, and after each cycle of checking is finished, it grabs a snapshot of the latest state of the whole schedule to do another round of checking. So even if it gets flooded with changes in between conflict-detection cycles, it will simply jump itself to the latest schedule version. So even if it falls behind to some degree, it will always catch back up. It may be plausible to load balance this responsibility by creating multiple traffic schedule nodes and designating each node to check conflicts for a subset of the robots, but I don’t think that will be necessary in the near future, and it can be added later without any modification to any of the RMF APIs or specifications. So I don’t plan on pursuing that until benchmarks and user requirements start to indicate that it will be worthwhile.


Edited by @mxgrey at 2021-07-15T07:06:24Z


This is the chosen answer.

Posted by @cwrx777:

Hi @gbiggs and @mxgrey,

Thanks for the answers.

For adapter nodes that expose REST API such as api-server, is it possible to run multiple of them for load balancing purpose?

Posted by @mxgrey:

The web team (cc @koonpeng) would have a better sense than I do of how to apply load balancing for the web components. But off hand, I can’t think of any reason there would be a problem.

Posted by @koonpeng:

Currently nodes like the api-server cannot be ran multi-instance, however, with the recommended kubernetes deployment, new pods can be automatically regenerated if the app or the server goes down (assuming you have a multi node kubernetes cluster).

There are startup routines to synchronize the data with rmf core so that loss of data is kept minimal if that ever occurs.

The reason we can’t support multi instance is due to several limitations.

  1. In order to support some features like health monitoring, the api server depends on time-series and stateful data from rmf. Multiple instances of the app will not receive the same set of data at the same time, leading to conflicting results.
  2. It is bottlenecked by rmf, as long as rmf does not support multi instance, scaling up the api server will not have any performance improvements.

If point 1 is solved, then we could have multiple instance of api server, but that is only useful if there is a need to support a large amount of concurrent read-only users, as any commands like submitting tasks will be bottlenecked by rmf’s capacity.

Posted by @cwrx777:

In this pull request, when the original schedule node (primary) comes back online, does the monitor node become the backup node again, performing monitoring role?

Posted by @mxgrey:

The out of the box monitor node will simply become a schedule node and stop being a monitor node. Depending on exactly what kind of fail over you want, you’ll need to take a higher level strategy than just using one simple monitor node.

One strategy might be to have a watchdog that watches for when a fail over even happens, and then forks off a new monitor node so that the system is ready if another fail over occurs. The limitation of this strategy is that it doesn’t help if the computer running the schedule node + monitor node + watchdog gets disconnected from the network.

Since the loss of a schedule node is very significant and should be quite rare, our current intention is that there would just be one simple monitor node running, and when a fail over occurs that monitor node would become the new schedule node, and the operators would be notified that something Very Bad happened. Then the operators would be armed with a playbook to decide where they should spin up the next monitor node depending on exactly what type of failure happened. But this strategy is subject to change as we learn more about what kind of architecture users will want for their deployments.