Thanks for the investigation. I'm afraid that the test run Jeffrey pointed out is a bit of a red herring, it had some manual intervention from me while it was running which has polluted the crashdumps.
I took a look at https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a, which has the following state:
=====
mysql-innodb-cluster/0 active idle 3/lxd/1 10.246.166.115 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
logrotated/10 active idle 10.246.166.115 Unit is ready.
mysql-innodb-cluster/1* active idle 4/lxd/1 10.246.167.161 Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
logrotated/11 active idle 10.246.167.161 Unit is ready.
mysql-innodb-cluster/2 blocked idle 5/lxd/2 10.246.166.105 Cluster is inaccessible from this instance. Please check logs for details.
logrotated/12 active idle 10.246.166.105 Unit is ready.
=====
Units 0 and 1 are clustered, but 2 did not join for some reason. In the crashdump, the juju relation between units 2 and 1 was clearly enabled.
The following message comes up a lot in the logs:
=====
2023-06-24 15:33:04 ERROR unit.mysql-innodb-cluster/2.juju-log server.go:325 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: This function is not available through a session to an instance belonging to an unmanaged replication group
=====
I don't see anything suspicious otherwise. There are some messages about a leader change, is it possible that this issue is due to an unfortunate timing of a leader change?
Thanks for the investigation. I'm afraid that the test run Jeffrey pointed out is a bit of a red herring, it had some manual intervention from me while it was running which has polluted the crashdumps.
I took a look at https:/ /solutions. qa.canonical. com/testruns/ b49db99c- 959f-417e- beca-cf4a252170 9a, which has the following state: cluster/ 0 active idle 3/lxd/1 10.246.166.115 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure. cluster/ 1* active idle 4/lxd/1 10.246.167.161 Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure. cluster/ 2 blocked idle 5/lxd/2 10.246.166.105 Cluster is inaccessible from this instance. Please check logs for details.
=====
mysql-innodb-
logrotated/10 active idle 10.246.166.115 Unit is ready.
mysql-innodb-
logrotated/11 active idle 10.246.167.161 Unit is ready.
mysql-innodb-
logrotated/12 active idle 10.246.166.105 Unit is ready.
=====
The crashdumps can be downloaded here: https:/ /oil-jenkins. canonical. com/artifacts/ b49db99c- 959f-417e- beca-cf4a252170 9a/generated/ generated/ kubernetes- maas/juju- crashdump- kubernetes- maas-2023- 06-24-19. 11.12.tar. gz
Units 0 and 1 are clustered, but 2 did not join for some reason. In the crashdump, the juju relation between units 2 and 1 was clearly enabled.
The following message comes up a lot in the logs: innodb- cluster/ 2.juju- log server.go:325 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
=====
2023-06-24 15:33:04 ERROR unit.mysql-
Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: This function is not available through a session to an instance belonging to an unmanaged replication group
=====
I don't see anything suspicious otherwise. There are some messages about a leader change, is it possible that this issue is due to an unfortunate timing of a leader change?