Hi,
The following happened on a 2.3.8 HA controller with 3 nodes.
At the start of all this, machine 1 is the mongodb primary. It all starts with 2 jujuds getting OOM killed on machine 2 (root cause is unknown) :
Aug 5 10:06:23 machine-2 kernel: [13712355.881318] Out of memory: Kill process 17322 (jujud) score 552 or sacrifice child
Aug 5 10:06:23 machine-2 kernel: [13712355.888758] Out of memory: Kill process 21610 (jujud) score 552 or sacrifice child
systemd then restarts jujud-machine-2 back up. Right after this, a new mongodb primary election takes place, presumably because the primary on machine 1 is overloaded :
Aug 5 10:06:47 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Connecting to <machine-1>:37017
Aug 5 10:06:48 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Successfully connected to <machine-1>:37017, took 75ms (2 connections now open to <machine-1>:37017)
Aug 5 10:07:38 machine-2 mongod.37017[11143]: [ReplicationExecutor] syncing from: <machine-1>:37017
Aug 5 10:07:48 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host <machine-1>:37017 because the pool meets constraints; 1 connections to that host remain open
Aug 5 10:08:13 machine-2 mongod.37017[11143]: [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election because member is not caught up enough to the most up-to-date member to call for priority takeover - must be within 2 seconds (mask 0x400)
Aug 5 10:08:14 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Ending connection to host <machine-1>:37017 due to bad connection status; 0 connections to that host remain open
Aug 5 10:08:14 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Operation timed out
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [rsBackgroundSync] Socket recv() timeout <machine-1>:37017
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [rsBackgroundSync] sync producer problem: 9001 socket exception [RECV_TIMEOUT] server [<machine-1>:37017]
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [ReplicationExecutor] could not find member to sync from
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] conducting a dry run election to see if we could be elected
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] dry election run succeeded, running for election
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] election succeeded, assuming primary role in term 652
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] transition to PRIMARY
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:24 machine-2 mongod.37017[11143]: [rsSync] transition to primary complete; database writes are now permitted
Aug 5 10:08:26 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Couldn't get a connection within the time limit
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to <machine-1>:37017 due to failed operation on a connection
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:36 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - ExceededTimeLimit: Operation timed out
Aug 5 10:08:38 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Couldn't get a connection within the time limit
Aug 5 10:08:38 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Failed to heartbeat to :27017 - HostUnreachable: short read
Aug 5 10:08:38 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Dropping all pooled connections to <machine-1>:37017 due to failed operation on a connection
Aug 5 10:08:50 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Couldn't get a connection within the time limit
Aug 5 10:08:52 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - ExceededTimeLimit: Operation timed out
Aug 5 10:08:52 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:53 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Successfully connected to <machine-1>:37017, took 705ms (1 connections now open to <machine-1>:37017)
Aug 5 10:08:53 machine-2 mongod.37017[11143]: [ReplicationExecutor] Member <machine-1>:37017 is now in state SECONDARY
Aug 5 10:08:59 machine-2 mongod.37017[11143]: [ReplicationExecutor] dropping unhealthy pooled connection to <machine-1>:37017
Aug 5 10:08:59 machine-2 mongod.37017[11143]: [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
Aug 5 10:08:59 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:09:03 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Successfully connected to <machine-1>:37017, took 4507ms (1 connections now open to <machine-1>:37017)
Aug 5 10:09:03 machine-2 mongod.37017[11143]: [ReplicationExecutor] Member <machine-1>:37017 is now in state ROLLBACK
Aug 5 10:09:10 machine-2 mongod.37017[11143]: [ReplicationExecutor] Member <machine-1>:37017 is now in state SECONDARY
On machine 1, nothing really says why mongodb is overloaded, the only relevant logs are :
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] Member <machine-2>:37017 is now in state PRIMARY
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:26 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:26 machine-1 mongod.37017[23313]: message repeated 3 times: [ [ReplicationExecutor] stepping down from primary, because a new term has begun: 652]
Aug 5 10:08:28 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:28 machine-1 mongod.37017[23313]: message repeated 3 times: [ [ReplicationExecutor] stepping down from primary, because a new term has begun: 652]
Aug 5 10:08:28 machine-1 mongod.37017[23313]: [ReplicationExecutor] Member <machine-0>:37017 is now in state SECONDARY
Aug 5 10:08:30 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:30 machine-1 mongod.37017[23313]: message repeated 3 times: [ [ReplicationExecutor] stepping down from primary, because a new term has begun: 652]
Aug 5 10:08:31 machine-1 mongod.37017[23313]: [replExecDBWorker-1] transition to SECONDARY
Some findAndModify commands were taking ~10s to complete before this happened though.
Anyway, it's around that time that machine 0 and machine 2 controllers stopped logging anything but :
2018-08-05 10:08:55 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:01 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:07 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:14 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:20 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:26 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:32 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
Also, the machine agents for 0 and 2 are in status "down" :
Machine State DNS Inst id Series AZ Message
0 down 1.1.1.1 <uuid> xenial nova ACTIVE
1 started 1.1.1.2 <uuid> xenial nova ACTIVE
2 down 1.1.1.3 <uuid> xenial nova ACTIVE
The goroutine dumps are available for machine 0 : https://pastebin.canonical.com/p/WjBKPbwBc2/ and machine 2 : https://pastebin.canonical.com/p/V6jnQxVvZP/
Please let us know if you need anything else from us quickly. We need to restart the controllers on machine 0 and 2 to revive them and to spread the load across all 3 controllers.
Thanks !
The one thing google-fu turns up is that
Failed to connect to <machine-1>:37017 - HostUnreachable: short read
might be occurring if one machine is configured for SSL and the other is not. However, AFAIK we always run mongod with --ssl.
As for: dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:08:55 ERROR juju.worker.
I was able to track down that it occurs when we start trying to connect, and something causes us to abort (that error is given during tomb.Dying() which means something called try.Kill())
However, it isn't clear why we would have called try.Kill(). The connect function: dialWebsocketMulti will call kill at any point that it returns (it calls defer try.Kill()), but it should only really exit with either an error or succeed. I wonder if the issue is that the 'try was stopped' is that we are suppressing the actual erorr.