VIP set as corosync node address, cluster state desynced, VIP down

Bug #1904515 reported by Trent Lloyd
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Confirmed
Undecided
Unassigned

Bug Description

I deployed an openstack-on-openstack (so no spaces) with hacluster for MySQL percona-cluster servers.

Originally the cluster was all in sync at both a pacemaker and mysql level, and mysql/1 was the VIP owner. Then the node hosting the VIP had a network/machine outage for ~10s due to openstack live migration.

mysql/0 juju-36680f-train-7
mysql/1 juju-36680f-train-19
mysql/2 juju-36680f-train-20

After that I found machines in the following problematic state:
- mysql/1 "crm status" shows all other nodes offline except itself. Both VIP and cl_mysql_monitor stopped on all nodes. Last updated time is current, but last change reported as "Tue Nov 17 05:11:25 2020 by hacluster via crmd on juju-36680f-train-19". DC reported as itself.

- mysql/0 and mysql/2 "crm status" shows all 3 nodes online, with the VIP still started on mysql/1. Last updated time is current, but last change reported as ~40 minutes newer "Tue Nov 17 05:52:05 2020 by hacluster via crmd on juju-36680f-train-7". Both nodes report DC as mysql/0 (juju-36680f-train-7)

- mysql/0 and mysql/2 continously logs the following messages:
Nov 17 06:50:39 juju-36680f-train-7 corosync[13083]: notice [TOTEM ] A new membership (10.5.0.130:5940) was formed. Members
Nov 17 06:50:39 juju-36680f-train-7 corosync[13083]: [TOTEM ] A new membership (10.5.0.130:5940) was formed. Members

- mysql/1 journal ctl corosync shows:
Nov 17 05:50:33 juju-36680f-train-19 corosync[8491]: notice [TOTEM ] A new membership (10.5.2.197:60) was formed. Members left: 1000
Nov 17 05:50:33 juju-36680f-train-19 corosync[8491]: notice [TOTEM ] Failed to receive the leave message. failed: 1000
Nov 17 05:50:33 juju-36680f-train-19 corosync[8491]: [TOTEM ] A new membership (10.5.2.197:60) was formed. Members left: 1000
Nov 17 05:50:33 juju-36680f-train-19 corosync[8491]: [TOTEM ] Failed to receive the leave message. failed: 1000
Nov 17 06:47:52 juju-36680f-train-19 corosync[8491]: statefump
Nov 17 06:47:52 juju-36680f-train-19 corosync[8491]: Writetofile
Nov 17 06:47:59 juju-36680f-train-19 corosync[8491]: statefump
Nov 17 06:47:59 juju-36680f-train-19 corosync[8491]: Writetofile

- mysql/0 and mysql/2 both have 10.5.100.0 (the VIP) as the ring0_addr for corosync.conf
- mysql/1 has it's real IP 10.5.2.197 in corosync.conf

In the past we had a bug where the corosync messenger was hung, and pacemaker has no heartbeat mechanism to actually detect this - hence the cluster state desync. Perhaps the same has happened here? Need to find the bug link.

2 main bugs to consider:
- The VIP should not get used as the juju address on the relation. There are a number of bugs related to this that I suspect are largely to do with not using a network space binding. This bug may not happen on a MAAS deployment but we should check.
- corosync should not get frozen/desynced so that the other nodes still think everything is still OK. This may be a bug that needs backporting that could happen in other scenarios.

Will potentially split into a second bug once either of these are confirmed.

Tags: seg sts
Trent Lloyd (lathiat)
tags: added: seg
Revision history for this message
Trent Lloyd (lathiat) wrote :
Revision history for this message
Trent Lloyd (lathiat) wrote :
Revision history for this message
Trent Lloyd (lathiat) wrote :
Revision history for this message
Brett Milford (brettmilford) wrote :

You allude to this at the end but, is this caused by https://bugs.launchpad.net/juju/+bug/1863916 ?

Revision history for this message
Trent Lloyd (lathiat) wrote : Re: [Bug 1904515] Re: VIP set as corosync node address, cluster state desynced, VIP down

Yeah that sounds like the cause of the incorrect node address!

Still also seems to be a bug though where Corosync/Pacemaker don’t realise
they are broken.

tags: added: sts
Changed in charm-hacluster:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.