OpenStack Percona Cluster Charm

Bug #1514472
Comment #2

Comment 2 for bug 1514472

Revision history for this message

Mario Splivalo (mariosplivalo) wrote on 2017-10-05:

#2

Hi, James.

I'm sorry to resurrect this one, but the issue remains on xenial too.

So, I deployed two-node percona-cluster. After deployment settled down I removed one of the units. That, indeed, left the remaining unit in non-operational state:

mysql> show status like 'wsrep_local_state_comment';
+---------------------------+-------------+
| Variable_name | Value |
+---------------------------+-------------+
| wsrep_local_state_comment | Initialized |
+---------------------------+-------------+
1 row in set (0.00 sec)

mysql> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 1 |
+--------------------+-------+
1 row in set (0.00 sec)

mysql> select 1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
mysql>

The proper 'local_state_comment' should be 'Synced'. As can be seen, remaining percona unit is non-operational as you can not query the user data.

With percona-5.6 it is easier to move the node back to the operational state:

mysql> SET GLOBAL wsrep_provider_options='pc.bootstrap=YES';
Query OK, 0 rows affected (0.00 sec)

mysql> select 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> show status like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name | Value |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.00 sec)

mysql>

Again, this happened because when juju was removing unit it didn't issue 'controlled shutdown' on the unit that's leaving relation - it merely removed the unit (shutting down the machine). Because parting unit was not cleanly shut down it could not notify remaining unit of it's state so remaining unit has no idea what happened - from remaining unit's perspective a network partition could have happened. To protect data integrity remaining unit switched into 'will not serve any data' mode.

However, the workaround for this is quite simple - before removing the unit the operator should ssh into the unit that is to be removed and simply stop mysqld service there. Once mysqld politely shut down juju can be used to remove the unit.

Hi, James.

I'm sorry to resurrect this one, but the issue remains on xenial too.

So, I deployed two-node percona-cluster. After deployment settled down I removed one of the units. That, indeed, left the remaining unit in non-operational state:

mysql> show status like 'wsrep_local_state_comment';
+---------------------------+-------------+
| Variable_name             | Value       |
+---------------------------+-------------+
| wsrep_local_state_comment | Initialized |
+---------------------------+-------------+
1 row in set (0.00 sec)

mysql> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 1     |
+--------------------+-------+
1 row in set (0.00 sec)

mysql> select 1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
mysql>

The proper 'local_state_comment' should be 'Synced'. As can be seen, remaining percona unit is non-operational as you can not query the user data.

With percona-5.6 it is easier to move the node back to the operational state:

mysql> SET GLOBAL wsrep_provider_options='pc.bootstrap=YES';
Query OK, 0 rows affected (0.00 sec)

mysql> select 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> show status like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.00 sec)

mysql>

Again, this happened because when juju was removing unit it didn't issue 'controlled shutdown' on the unit that's leaving relation - it merely removed the unit (shutting down the machine). Because parting unit was not cleanly shut down it could not notify remaining unit of it's state so remaining unit has no idea what happened - from remaining unit's perspective a network partition could have happened. To protect data integrity remaining unit switched into 'will not serve any data' mode.

However, the workaround for this is quite simple - before removing the unit the operator should ssh into the unit that is to be removed and simply stop mysqld service there. Once mysqld politely shut down juju can be used to remove the unit.