During scale-out of cluster (zaza-openstack-tests) the leader fails to join in the new instance when related to prometheus
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL InnoDB Cluster Charm |
Fix Committed
|
High
|
Alex Kavanagh |
Bug Description
This appears to be a random bug, and may be related to the other scale out bugs. I think it may be due to the prometheus code, but I'm not entirely sure yet.
Juju status:
mysql-innodb-
mysql-innodb-
mysql-innodb-
(ignore unit /4 - it is added during test_802 AFTER test_801 has failed).
--
When adding a unit to a cluster that was reduced to 2 units, test_801_add_unit fails {1}. This is due to the following error message in mysql-innodb-
2023-04-03 16:06:03 ERROR unit.mysql-
[33mWARNING: [0mA GTID set check of the MySQL instance at '172.16.0.213:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.
172.16.0.213:3306 has the following errant GTIDs that do not exist in the cluster:
04f79476-
[33mWARNING: [0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 172.16.0.213:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.
Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Clone based recovery selected through the recoveryMethod option
--
i.e. it looks like some local activity has taken place on the new unit (172.16.0.213). I *think* it may be the prometheus hook, but I'm not entirely sure: (from mysql-innodb-
tracer: ++ queue handler reactive/
tracer: ++ queue handler reactive/
tracer: ++ queue handler reactive/
2023-04-03 16:04:48 INFO unit.mysql-
2023-04-03 16:04:49 INFO unit.mysql-
2023-04-03 16:04:49 INFO unit.mysql-
2023-04-03 16:04:49 WARNING unit.mysql-
2023-04-03 16:04:49 WARNING unit.mysql-
2023-04-03 16:04:49 DEBUG unit.mysql-
2023-04-03 16:04:49 DEBUG unit.mysql-
2023-04-03 16:04:49 DEBUG unit.mysql-
2023-04-03 16:04:49 DEBUG unit.mysql-
2023-04-03 16:04:49 DEBUG unit.mysql-
2023-04-03 16:04:49 DEBUG unit.mysql-
tracer: set flag local.prom-
tracer: -- dequeue handler reactive/
2023-04-03 16:04:49 INFO unit.mysql-
--
I think that if this is done with group replication 'on' then these are written to the bin log and then it won't cluster at that point as it is effectively split-brained.
From the error.log for /3
2023-04-
2023-04-
2023-04-
2023-04-
--
I think the teltale is "master_log_pos= 4", which implies that something has been written to the master log, which will then have some entries and thus won't be clustered in (at least I think that's what is happening).
So the culprit may be "create_
--
{1} job output from zaza:
2023-04-03 15:57:40.721762 | focal-medium | 2023-04-03 15:57:40 [INFO] test_801_add_unit (zaza.openstack
2023-04-03 15:57:40.721791 | focal-medium | 2023-04-03 15:57:40 [INFO] Add mysql-innodb-
2023-04-03 15:57:40.721805 | focal-medium | 2023-04-03 15:57:40 [INFO] ...
2023-04-03 15:57:40.721816 | focal-medium | 2023-04-03 15:57:40 [INFO] Wait till model is idle ...
2023-04-03 15:57:41.225888 | focal-medium | 2023-04-03 15:57:41 [INFO] Adding unit after removed unit ...
2023-04-03 15:57:41.637882 | focal-medium | 2023-04-03 15:57:41 [INFO] Wait until 3 units ...
2023-04-03 15:57:41.761492 | focal-medium | 2023-04-03 15:57:41 [INFO] Wait for application states ...
2023-04-03 15:57:41.762321 | focal-medium | 2023-04-03 15:57:41 [INFO] Waiting for application states to reach targeted states.
2023-04-03 15:57:41.763875 | focal-medium | 2023-04-03 15:57:41 [INFO] Waiting for an application to be present
2023-04-03 15:57:41.764417 | focal-medium | 2023-04-03 15:57:41 [INFO] Now checking workload status and status messages
2023-04-03 15:57:42.270052 | focal-medium | 2023-04-03 15:57:42 [INFO] Application prometheus2 is ready.
2023-04-03 15:57:42.276635 | focal-medium | 2023-04-03 15:57:42 [INFO] Application keystone is ready.
2023-04-03 15:57:42.283250 | focal-medium | 2023-04-03 15:57:42 [INFO] Application keystone-
2023-04-03 15:57:42.294215 | focal-medium | 2023-04-03 15:57:42 [INFO] Application vault is ready.
2023-04-03 15:57:42.297183 | focal-medium | 2023-04-03 15:57:42 [INFO] Application vault-mysql-router is ready.
2023-04-03 16:07:41.863754 | focal-medium | 2023-04-03 16:07:41 [INFO] Applications left: mysql-innodb-
2023-04-03 16:17:42.029986 | focal-medium | 2023-04-03 16:17:42 [INFO] Applications left: mysql-innodb-
2023-04-03 16:27:42.037384 | focal-medium | 2023-04-03 16:27:42 [INFO] Applications left: mysql-innodb-
2023-04-03 16:37:42.335575 | focal-medium | 2023-04-03 16:37:42 [INFO] Applications left: mysql-innodb-
2023-04-03 16:42:42.777747 | focal-medium | 2023-04-03 16:42:42 [INFO] TIMEOUT: Workloads didn't reach acceptable status:
2023-04-03 16:42:42.778121 | focal-medium | 2023-04-03 16:42:42 [INFO] Timed out waiting for 'mysql-
2023-04-03 16:42:42.778160 | focal-medium | 2023-04-03 16:42:42 [INFO] Timed out waiting for 'mysql-
2023-04-03 16:42:42.779682 | focal-medium | 2023-04-03 16:42:42 [INFO] ERROR
summary: |
During scale-out of cluster (zaza-openstack-tests) the leader fails to - join in the new instance + join in the new instance when related to prometheus |
Changed in charm-mysql-innodb-cluster: | |
assignee: | nobody → Alex Kavanagh (ajkavanagh) |
importance: | Undecided → High |
Fix proposed to branch: master /review. opendev. org/c/openstack /charm- mysql-innodb- cluster/ +/879541
Review: https:/