Once a Juju client connects to a Juju HA controller, it will remain connected to that controller until either the client or the controller is restarted. Due to current bad client behaviour (LP: #1793245), the clients will target a single controller and that controller will continue to gain clients, while the other controllers will not - this client behaviour, coupled with the fact that there appears to be no hard limit on the number of client connections a controller will accept, leads to Juju controllers being OOM killed (LP: #1799360). This in turn leads to stability issues due to the bad client behaviour combined with the controller-to-self and controller-to-controller communication often failing (LP: #1799363).
Juju HA controllers should actively work to distribute client connections - some of the options to do this include:
- Randomly failing client connections (e.g. reject one in three connections on the basis that the clients will try another controller and/or retry).
- Communicate the number client connections between controllers and disconnect clients when a controller has X (say 500-1000) more clients than the others in stable state. This requires clients to be better behaved (LP: #1793245 needs to be fixed first), so that they are likely to reconnect to another controller.
- Provide a "redirect to controller X" in the API - this is similar to the above, but allows clients to be specifically directed to another controller that is known to be healthy and less loaded.
- Front the Juju HA controllers with some form of load balancer that actively distributes incoming client connections to the jujud API servers.
It is worth noting that part of the connection distribution problem can simply be avoided by having clients that randomise the controller IP list. This does not however address situations that are caused by bringing one controller down and back up again (it will still have ~0 connections until clients are disconnected from other controllers over some time period).
https:/ /github. com/juju/ juju/pull/ 9360 targets 2.4 with a couple changes that should help connections.
Changing from fast retries to exponential backoff, and rand.Shuffle of the controller addresses for Agents.