In test run https://solutions.qa.canonical.com/v2/testruns/46610432-1ec4-4e06-b43b-3b73e000d31a, we are upgrading yoga-focal to yoga-jammy. The upgrading of the mysql-innodb-cluster charms is successful but the cluster stays in the blocked state after with the message: "Cluster is inaccessible from this instance. Please check logs for details."
In the logs we see:
--------------
2023-03-14 13:14:11 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
File "<string>", line 2, in <module>
mysqlsh.Error: Shell Error (51314): Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)
--------------
and
--------------
2023-03-14T09:44:18.161421Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.71:33061 when joining a group. My local port is: 33061.'
2023-03-14T09:44:18.161556Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.152:33061 when joining a group. My local port is: 33061.'
2023-03-14T09:44:18.161623Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.71:33061 when joining a group. My local port is: 33061.'
2023-03-14T09:44:18.161632Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 33061'
2023-03-14T09:44:18.239838Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33061'
--------------
More details can be found in the crashdump: https://oil-jenkins.canonical.com/artifacts/46610432-1ec4-4e06-b43b-3b73e000d31a/generated/generated/openstack/juju-crashdump-openstack-2023-03-14-13.10.13.tar.gz
mysql8 clustering is a bit sensitive to networking and delays in responses from other nodes. My guess is that everything just took too long and the clustering code (in mysql8, not the charm), just gave up. In that instance it can be quite tricky to recover, but basically you have to pick a node, force it to be the lead in the cluster, and then force the other two back into the cluster.
The logs are full of "Error on opening connect to ..." repeated over and over, indicating that the other node(s) are simply "not there".
Was this a resource constrained system that this was being tested on?
This is essentially very similar to https:/ /bugs.launchpad .net/charm- mysql-innodb- cluster/ +bug/1917332, for example, except that, in that case the lead node was still running.
If all the units were upgraded at the same time (series-upgrade) with no settling between them, then this bug is very similar to https:/ /bugs.launchpad .net/charm- mysql-innodb- cluster/ +bug/1907202
Otherwise it may be something else - mysql8 is tricky!
If you could provide some more context, please, to how the units are being series upgraded, that would be great. Thanks.