MySQL InnoDB Cluster Charm

Bug #1926449
Comment #7

Comment 7 for bug 1926449

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-07-21:

I spent a while looking at the logs, and as far as I can tell, this is due to a complete outage of the mysql cluster (which is part of the test). The action to restore the cluster is to run "reboot-cluster-from-complete-outage" on the unit that has the GTID superset of transactions.

According to the guide [1] this action is run on any node, and then either the cluster reboots OR it tells you which node to pick. From the logs (from the crashdump):

ag reboot-cluster-from-complete-outage
0/lxd/8/juju-show-status-log/mysql-innodb-cluster_0
19:18 Jul 2023 15:44:12Z juju-unit executing running action reboot-cluster-from-complete-outage

0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log
76723:2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
76724:2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to d
ate instance: '10.246.168.165:3306'.

e.g. the node the test ran the action on wasn't the latest and thus replied with the correct unit IP address. However, the logs then don't show any other action on the other units.

I suspect that the test needs to be updated to detect this condition and then select the correct unit.

I'm setting this to invalid; however, if further information indicates that the above assessment isn't correct then please re-open the bug.

Note: it could be argued that the charms should then chat amongst themselves and run the action on the corresponding unit that has the GTID; that would be a feature request as currently the charms don't provide that feature. If that feature is desired then please open a NEW bug with the feature request.

[1] https://docs.openstack.org/charm-guide/latest/admin/managing-power-events.html#mysql-innodb-cluster

I spent a while looking at the logs, and as far as I can tell, this is due to a complete outage of the mysql cluster (which is part of the test).  The action to restore the cluster is to run "reboot-cluster-from-complete-outage" on the unit that has the GTID superset of transactions.

According to the guide [1] this action is run on any node, and then either the cluster reboots OR it tells you which node to pick.  From the logs (from the crashdump):

ag reboot-cluster-from-complete-outage
0/lxd/8/juju-show-status-log/mysql-innodb-cluster_0
19:18 Jul 2023 15:44:12Z  juju-unit  executing    running action reboot-cluster-from-complete-outage

0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log
76723:2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
76724:2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server  {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to d
ate instance: '10.246.168.165:3306'.

e.g. the node the test ran the action on wasn't the latest and thus replied with the correct unit IP address.  However, the logs then don't show any other action on the other units.

I suspect that the test needs to be updated to detect this condition and then select the correct unit.

I'm setting this to invalid; however, if further information indicates that the above assessment isn't correct then please re-open the bug.

Note: it could be argued that the charms should then chat amongst themselves and run the action on the corresponding unit that has the GTID; that would be a feature request as currently the charms don't provide that feature.  If that feature is desired then please open a NEW bug with the feature request.

[1] https://docs.openstack.org/charm-guide/latest/admin/managing-power-events.html#mysql-innodb-cluster