Canonical Juju

Intermitent HA controller down

Bug #1973323 reported by Juan M. Tirado on 2022-05-13

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	High	Heather Lanigan

Bug Description

During the triage of bug https://bugs.launchpad.net/juju/+bug/1973164 I found that after rebooting one of the controllers, the status command shows one of the controller in an endless started->down loop.

To reproduce use the same steps mentioned in the bug above and kill a secondary :

juju bootstrap localhost lxd-lcl-controller
juju add-machine -m controller -n 2
juju enable-ha --to 1,2

Find the HA primary:

  controller-machines: [6/1657]
    "0":
      instance-id: juju-4fe7ec-0
      ha-status: ha-enabled
      ha-primary: true
PRIMARY=0
juju ssh -m $PRIMARY -- sudo reboot

In my case I rebooted controller 0. Then, juju status reports:

Model Controller Cloud/Region Version SLA Timestamp
controller lxd-lcl-controller localhost/localhost 2.9.29 unsupported 14:53:51+02:00

Machine State DNS Inst id Series AZ Message
0 started 10.73.25.245 juju-c924af-0 focal Running
1 started 10.73.25.60 juju-c924af-1 focal Running
2 down 10.73.25.130 juju-c924af-2 focal Running

and then...

Model Controller Cloud/Region Version SLA Timestamp
controller lxd-lcl-controller localhost/localhost 2.9.29 unsupported 14:54:40+02:00

Machine State DNS Inst id Series AZ Message
0 started 10.73.25.245 juju-c924af-0 focal Running
1 started 10.73.25.60 juju-c924af-1 focal Running
2 started 10.73.25.130 juju-c924af-2 focal Running

This might be a concurrency issue because the problem may appear eventually.

Tags:

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2022-05-13:

Please upload controller logs for investigative purposes. My understanding this not always reproduced.

Revision history for this message

Juan M. Tirado (tiradojm) wrote on 2022-05-13:

logs.zip Edit (21.7 KiB, application/zip)

I attach the logs for the three machines.

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2022-05-13:

I have hit the same result, in a different way. After controller upgrade.

The 2 secondary agents are going up and down.

Not sure why yet.

Nothing stands out in the logs provided: #2

Juan M. Tirado (tiradojm) on 2022-05-16

Changed in juju:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2022-05-16:

A side effect is that agents of machines in other models go down and back up too. Uniter and leadership workers for units on those machines are restarting a log, started-count: 1560.

Changed in juju:
assignee:	nobody → Heather Lanigan (hmlanigan)

John A Meinel (jameinel) on 2022-05-26

Changed in juju:
milestone:	2.9.31 → none

Revision history for this message

Joseph Phillips (manadart) wrote on 2022-07-13:

I believe I experienced this with the develop HEAD.

I had a controller on AWS where one node kept reporting "down" every few seconds, but there were no errors in any of the logs, and status history showed the controller unit as "idle" since deployment.

Revision history for this message

Arif Ali (arif-ali) wrote on 2022-07-20:

Hi, we've had a similar issue in a few sites now with 2.9.29, and looking at the logs, we first get multiple WARNINGs with TLS handshake failed, like below, and this appear on all 3 controllers

WARNING juju.mongo open.go:166 TLS handshake failed: EOF

After this, we get the following error several times across all the hosts

ERROR juju.rpc server.go:600 error writing response: *tls.permanentError write tcp <snip ip>:17070-><snip ip>:49096: write: connection reset by peer

This is when is starts to misbehave, with the other models starting to give various units to be down, and then eventually, the 2 secondary ones start fluctuating.

Both set of logs I have, have a similar set of entries

Restarting the jujud-machine service for those units where the mongodb is SECONDARY works around the problem until the next time.

tags:

added: sts