Charm stuck in waiting after rejoining the cluster

Bug #1983158 reported by Vern Hart
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Triaged
Medium
Unassigned

Bug Description

Running focal/ussuri.

Was testing availability zone failure and the node with one of the mysql-innodb-cluster units went down.

After the nodes came up, all services restored fairly quickly except mysql-innodb-cluster/2.

It says:
  Cluster is inaccessible from this instance. Please check logs for details.

Checking the logs I see:
  RuntimeError: Dba.get_cluster: Group replication does not seem to be active in instance '10.103.223.3:3306'
10.103.223.3 is the IP of mysql-innodb-cluser/2

I tried the juju action (from a healthy unit) to rejoin the cluster:
  juju run-action mysql-innodb-cluster/0 --wait rejoin-instance address=10.103.223.3
But that failed saying:
  The group_replication_group_name cannot be changed when Group Replication is running
The mysql logs say there may be corruption in the relay log. It also says it set the member to read-only and then left the group (well before me running the rejoin-instance action).
I suspect running the reboot-cluster-from-complete-outage on one of the good units would probably work but that seems like a bigger hammer than necessary.

I'll try removing and re-adding:
  juju run-action mysql-innodb-cluster/0 --wait remove-instance address=10.103.223.3 force=true
  juju run-action mysql-innodb-cluster/0 --wait add-instance address=10.103.223.3
force=true was necessary because the node is marked ERROR.
These action succeeded but now the juju status for that unit says:
  Instance not yet configured for clustering

After connecting to mysql in the bad unit (using the pw from leader-get mysql.passwd) I executed:
  stop group_replication;
  reset replica;
Afterwards, running the add-instance action worked and the cluster-status action shows all three nodes joined with the new one RO, as expected.

However, juju status still shows it's waiting with:
  Instance not yet configured for clustering

I've tried manually running hooks, restarting mysql and juju agent on the suspect node, and the status still shows waiting.

Checking the logs, neither mysql nor juju are showing any errors and the unit appears to be functioning appropriately so this seems to be a charm bug and not an actual state of things.

Tags: sts
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Adding back an instance after it has been forcibly removed causes the charm state to go off-sync. The charm currently does not monitor the cluster state properly, so it would require the following command to set the charm-state back to what it should be after adding back the instance:

juju run -u mysql-innodb-cluster/leader -- leader-set cluster-instance-configured-192-168-0-32=True

replace 192-168-0-32 with the IP of the instance you added back

tags: added: sts
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-mysql-innodb-cluster (master)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

What needs to be resolved is how to fix the charm's state indicates 'not yet configured' when using out-of-band (i.e. not using charm actions) mysql actions to resolve clustering issues. What needs to happen is for the charm to 'discover' that the underlying mysql cluster is healthy, that the unit is configured and reflect that, rather than using previously cached information (flags, state on relations, etc.) The charm should always try to resolve it's state from the environment, rather than holding on to cached information.

Changed in charm-mysql-innodb-cluster:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Note that this is almost certainly related to: https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/2013078, and they are likely to be fixed at the same time.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Hi Alex. As I commented in the other bug you linked above, I strongly disagree they are related and very unlikely to be fixed at the same time by addressing a single root cause.

Regarding detecting the cluster status and setting the flags appropriately, we've discussed that in the past and agreed to not do that at this time, that is why in my patch above I am merely error'ing out so intervention steps can be pointed out to the user, as a first step consisting of UX improvement for the resolution of the problem.

But I agree the ideal but more complex solution would be to detect cluster status and set the flags appropriately.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Quoting from the above bug:

---

I'll try removing and re-adding:
  juju run-action mysql-innodb-cluster/0 --wait remove-instance address=10.103.223.3 force=true
  juju run-action mysql-innodb-cluster/0 --wait add-instance address=10.103.223.3
force=true was necessary because the node is marked ERROR.
These action succeeded but now the juju status for that unit says:
  Instance not yet configured for clustering

After connecting to mysql in the bad unit (using the pw from leader-get mysql.passwd) I executed:
  stop group_replication;
  reset replica;
Afterwards, running the add-instance action worked and the cluster-status action shows all three nodes joined with the new one RO, as expected.

However, juju status still shows it's waiting with:
  Instance not yet configured for clustering

---

i.e. it's related as an instance was removed or added. I'm just sign-posting so that people who read the bugs can see that other bugs also exist around adding and removing instances.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.