flapping presence on MAAS in HA when controller shut down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned |
Bug Description
Bootstrap a controller on MAAS and enable HA. Stop controller machine 2.
juju status -m controller will take a long time to change the state of machine 2 to down, and it will oscillate between started and down for a while (depending on which of the remaining controller machines the client asks for the status).
Running juju_presence_
Eventually (after about 10 mins?) controller 1 will notice that controller 2 is gone and the presence will stop flapping.
[controller 0]
ubuntu@nuc2:~$ juju_presence_
Querying @jujud-machine-0 introspection socket: /presence/
[5983ba1d-
AGENT SERVER CONN ID STATUS
machine-0 machine-0 4 alive
machine-0 machine-0 6 alive
machine-0 machine-1 17 alive
machine-0 machine-2 8 missing
machine-1 machine-0 2 alive
machine-1 machine-1 8 alive
machine-1 machine-1 10 alive
machine-1 machine-2 6 missing
machine-2 machine-2 2 missing
machine-2 machine-2 4 missing
[5a08305e-
AGENT SERVER CONN ID STATUS
machine-0 (controller) machine-0 5 alive
machine-0 machine-1 2 alive
machine-1 (controller) machine-1 9 alive
machine-2 (controller) machine-2 3 missing
unit-ubuntu-lite-4 machine-0 399 alive
unit-ubuntu-lite-5 machine-0 402 alive
unit-ubuntu-lite-6 machine-0 403 alive
ubuntu@nuc2:~$ juju_pubsub_report
Querying @jujud-machine-0 introspection socket: /pubsub
PubSub Report:
Source: machine-0
Target: machine-1
Status: connected
Addresses: [10.0.0.170:17070]
Queue length: 0
Sent count: 148270
Target: machine-2
Status: disconnected
Addresses: [10.0.0.32:17070]
Queue length: 0
Sent count: 2183
[controller 1]
ubuntu@nuc7:~$ juju_presence_
Querying @jujud-machine-1 introspection socket: /presence/
[5983ba1d-
AGENT SERVER CONN ID STATUS
machine-0 machine-0 4 alive
machine-0 machine-0 6 alive
machine-0 machine-1 17 alive
machine-0 machine-2 8 alive
machine-1 machine-0 2 alive
machine-1 machine-1 8 alive
machine-1 machine-1 10 alive
machine-1 machine-2 6 alive
machine-2 machine-2 2 alive
machine-2 machine-2 4 alive
[5a08305e-
AGENT SERVER CONN ID STATUS
machine-0 (controller) machine-0 5 alive
machine-0 machine-1 2 alive
machine-1 (controller) machine-1 9 alive
machine-2 (controller) machine-2 3 alive
unit-ubuntu-lite-4 machine-0 399 alive
unit-ubuntu-lite-5 machine-0 402 alive
unit-ubuntu-lite-6 machine-0 403 alive
ubuntu@nuc7:~$ juju_pubsub_report
Querying @jujud-machine-1 introspection socket: /pubsub
PubSub Report:
Source: machine-1
Target: machine-0
Status: connected
Addresses: [10.0.0.156:17070]
Queue length: 0
Sent count: 14468
Target: machine-2
Status: connected
Addresses: [10.0.0.32:17070]
Queue length: 0
Sent count: 760
Changed in juju: | |
importance: | Undecided → High |
I think we should have the presence worker publish "I'm alive" messages on udp on the controller port, failing back to the apiserver port if the controller port isn't there.
A udp packet every second isn't a big issues, and we can fail if we think the agent is alive and we don't get a ping after 5 seconds.