Deployment blocked with "Cluster is inaccessible from this instance"

Bug #1926449 reported by Marian Gasparovic
30
This bug affects 7 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
New
Undecided
Unassigned

Bug Description

All three units are blocked with a message "Cluster is inaccessible from this instance. Please check logs for details."

Logs show
RuntimeError: Unable to find a primary member in the cluster
and later
RuntimeError: Dba.get_cluster: Group replication does not seem to be active in instance '192.168.33.54:3306'

Ussuri-focal setup, mysql-innodb-cluster v74

This is the first time we encountered it in our tests

Crashdump https://oil-jenkins.canonical.com/artifacts/7414391a-68b5-448d-935d-67f778da5d5e/generated/generated/openstack/juju-crashdump-openstack-2021-04-28-13.42.07.tar.gz

Other artifacts
https://oil-jenkins.canonical.com/artifacts/7414391a-68b5-448d-935d-67f778da5d5e/index.html

Revision history for this message
David Ames (thedac) wrote :
Download full text (14.7 KiB)

Initial log parsing

The cluster comes up fine before vault sends out certificates. Once the cluster receives certificates the cluster goes into a rolling restart as expected.

However, the first node to restart reports problems immediately with "Failed to stop the group replication applier thread" during shutdown

Here is a sorted (interleaved grep from the mysql error logs:
Formatted version: https://pastebin.ubuntu.com/p/xf572df7vD/

mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:44.880786Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-0ubuntu0.20.04.1).:
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:45.073488Z 11 [ERROR] [MY-011535] [Repl] Plugin group_replication reported: 'Failed to stop the group replication applier thread.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:20:48.510785Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.54
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:20:48.515514Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:20:48.526308Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.54
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:20:48.527036Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:50.688199Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:50.695954Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:56.752996Z 0 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:58.223924Z 10 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.::
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:21:00.623164Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:21:00.626408Z 2 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.168.33.52:3306.'
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:21:00.630934Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.1...

Revision history for this message
David Ames (thedac) wrote :

Confirmed the leader restarting last is a non-issue by manually testing.

The root cause appears to be the failure during shutdown: "Failed to stop the group replication applier thread."

Unfortunately, the mysql documentation is sparse beyond acknowledging this is an error [0] that exists.

[0] https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html#error_er_grp_rpl_applier_thd_stop_error

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Setting to incomplete as it's been over 2 years since the bug was reported. It's possible that the bug has been fixed in more recent versions of the charm. This bug will time out in about 2 months.

Changed in charm-mysql-innodb-cluster:
status: New → Incomplete
Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote :

Seen once more this week in the following two runs:
- https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a
- https://solutions.qa.canonical.com/testruns/5f64870e-6428-4701-b97c-5f13044b7bcb

Artifacts for those can be found at the bottom of the page on each of those runs.

Pd: Over the last year we have seen this bug numerous times, see https://solutions.qa.canonical.com/bugs/1926449, for all occurrences and artifacts.

Changed in charm-mysql-innodb-cluster:
status: Incomplete → New
Changed in charm-mysql-innodb-cluster:
status: New → Incomplete
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Please provide the juju log files, and mysql/error.log file, from an instance of the the mysql-innodb-cluster where this has happened, and not just a link to the sqa test run. Although the links are useful, it's much more useful to have the logs actually in the bug for future reference. Thanks.

Revision history for this message
Jeffrey Chang (modern911) wrote :
Download full text (4.3 KiB)

This is from latest occurrence, https://oil-jenkins.canonical.com/artifacts/785b0abe-24e7-4075-a046-70bc44ecfa39/generated/generated/openstack/juju-crashdump-openstack-2023-07-18-16.25.40.tar.gz

From 9ffd5995-11e6-443d-b2c5-c6e1235e7de9/0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log

We saw many occurrences of this generic error.
2023-07-18 15:35:22 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
mysqlsh.Error: Shell Error (51314): Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)

This is kind new to me.
2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to date instance: '10.246.168.165:3306'.

From 9ffd5995-11e6-443d-b2c5-c6e1235e7de9/0/lxd/8/var/log/mysql/error.log
2023-07-18T13:24:05.812325Z 14 [ERROR] [MY-011685] [Repl] Plugin group_replication reported: 'The group_replication_group_name option is mandatory'
2023-07-18T13:24:05.812350Z 14 [ERROR] [MY-011660] [Repl] Plugin group_replication reported: 'Unable to start Group Replication on boot'

2023-07-18T15:21:42.644552Z 539422 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 539422'
2023-07-18T15:21:42.644562Z 274917 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 274917'
2023-07-18T15:21:42.644569Z 539422 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2023-07-18T15:21:42.644582Z 274917 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2023-07-18T15:21:42.647084Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.'

202...

Read more...

Changed in charm-mysql-innodb-cluster:
status: Incomplete → New
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I spent a while looking at the logs, and as far as I can tell, this is due to a complete outage of the mysql cluster (which is part of the test). The action to restore the cluster is to run "reboot-cluster-from-complete-outage" on the unit that has the GTID superset of transactions.

According to the guide [1] this action is run on any node, and then either the cluster reboots OR it tells you which node to pick. From the logs (from the crashdump):

ag reboot-cluster-from-complete-outage
0/lxd/8/juju-show-status-log/mysql-innodb-cluster_0
19:18 Jul 2023 15:44:12Z juju-unit executing running action reboot-cluster-from-complete-outage

0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log
76723:2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
76724:2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to d
ate instance: '10.246.168.165:3306'.

e.g. the node the test ran the action on wasn't the latest and thus replied with the correct unit IP address. However, the logs then don't show any other action on the other units.

I suspect that the test needs to be updated to detect this condition and then select the correct unit.

I'm setting this to invalid; however, if further information indicates that the above assessment isn't correct then please re-open the bug.

Note: it could be argued that the charms should then chat amongst themselves and run the action on the corresponding unit that has the GTID; that would be a feature request as currently the charms don't provide that feature. If that feature is desired then please open a NEW bug with the feature request.

[1] https://docs.openstack.org/charm-guide/latest/admin/managing-power-events.html#mysql-innodb-cluster

Changed in charm-mysql-innodb-cluster:
status: New → Invalid
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Thanks for the investigation. I'm afraid that the test run Jeffrey pointed out is a bit of a red herring, it had some manual intervention from me while it was running which has polluted the crashdumps.

I took a look at https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a, which has the following state:
=====
mysql-innodb-cluster/0 active idle 3/lxd/1 10.246.166.115 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
  logrotated/10 active idle 10.246.166.115 Unit is ready.
mysql-innodb-cluster/1* active idle 4/lxd/1 10.246.167.161 Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
  logrotated/11 active idle 10.246.167.161 Unit is ready.
mysql-innodb-cluster/2 blocked idle 5/lxd/2 10.246.166.105 Cluster is inaccessible from this instance. Please check logs for details.
  logrotated/12 active idle 10.246.166.105 Unit is ready.
=====

The crashdumps can be downloaded here: https://oil-jenkins.canonical.com/artifacts/b49db99c-959f-417e-beca-cf4a2521709a/generated/generated/kubernetes-maas/juju-crashdump-kubernetes-maas-2023-06-24-19.11.12.tar.gz

Units 0 and 1 are clustered, but 2 did not join for some reason. In the crashdump, the juju relation between units 2 and 1 was clearly enabled.

The following message comes up a lot in the logs:
=====
2023-06-24 15:33:04 ERROR unit.mysql-innodb-cluster/2.juju-log server.go:325 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: This function is not available through a session to an instance belonging to an unmanaged replication group
=====

I don't see anything suspicious otherwise. There are some messages about a leader change, is it possible that this issue is due to an unfortunate timing of a leader change?

Changed in charm-mysql-innodb-cluster:
status: Invalid → New
Revision history for this message
Jeffrey Chang (modern911) wrote :

Please ignore comment #6, there's some human intervention to push through the test.

Checked other runs below and feel it's better to track in LP#1998356.
https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a
https://solutions.qa.canonical.com/testruns/5f64870e-6428-4701-b97c-5f13044b7bcb

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

> The following message comes up a lot in the logs:
=====
2023-06-24 15:33:04 ERROR unit.mysql-innodb-cluster/2.juju-log server.go:325 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: This function is not available through a session to an instance belonging to an unmanaged replication group
=====

Yes, that's an unhandled issue on the 8.0/stable branch; there's a fix in progress on the master branch that will get backported.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

> Checked other runs below and feel it's better to track in LP#1998356.
> https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a
> https://solutions.qa.canonical.com/testruns/5f64870e-6428-4701-b97c-5f13044b7bcb

There don't appear to be any mysql logs/unit logs in the dumps; or I'm looking in the wrong place. Please could you link at least to the mysql/error.log and mysql unit logs for the failed runs.

Revision history for this message
Dagmawi Biru (dagbiru) wrote :

- MySQL 8.0.33, yoga/stable

I'm also seeing this issue in my environment, being triggered soon after the initial setup of vault (unsealing + root ca generation):
---
Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: Unable to find a primary member in the cluster
---

and

---
unit-mysql-innodb-cluster-2: 20:31:05 ERROR unit.mysql-innodb-cluster/2.juju-log Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: Group replication does not seem to be active in instance '172.16.66.165:3306'
---

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.