During full power outage, bootstrap-pxc action succeeded, but bootstrapped node was non-primary
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Percona Cluster Charm |
New
|
Undecided
|
Unassigned |
Bug Description
During an intentional datacenter power failure test on a 20.08 charms bionic-ussuri cloud, percona-cluster bootstrap-pxc action succeeded to start up mysql on a unit, but the cluster did not recover as expected.
When my servers came online, this was the juju status:
mysql/0 blocked MySQL is down. Sequence Number: 23159085. Safe To Bootstrap: 0
mysql/1* blocked MySQL is down. Sequence Number: 23159085. Safe To Bootstrap: 0
mysql/2 blocked MySQL is down. Sequence Number: 23158268. Safe To Bootstrap: 0
As you can see, mysql/2 went down slightly before mysql/0 and mysql/1 as it's power was not redundant across two PDUs as the other two units were. This may have complicated matters, as I'm not able to reproduce the issue on a cluster that powers off all on the same sequence number.
When running bootstrap-pxc, it hung during 'systemctl stop mysql' call (line 578 in actions.py) for over 35 minutes. I kill -9'd the mysqld process to assist the bootstrap action along it's way.
After that, the result of the action was successful, however, the status of all the mysql units stayed at blocked 'waiting to bootstrap cluster'.
When I attached to the unit mysql/1 (which was the one I ran the bootstrap-pxc action against), I could login to mysql but when I queried it's wsrep status, it was "non-primary". When I checked mysql/0's logs, it was stating that there were two units in the cluster (mysql/0 and mysql/1) but there was no primary and it was stuck waiting for SST/IST without a primary unit in the cluster.
I had to manually stop the mysql/1 unit's mysql server with kill -9, and then start the service manually again with 'systemctl start <email address hidden>' to resolve the issue. Once this was done, all three units joined the cluster and sync/primary was established.
I feel that there may need to be a timeout for bootstrap-pxc action and a force option that would perform this kill -9 <mysqlpid> operation so one doesn't have to login to the unit.
More details have been provided offline to the openstack charmers regarding this event. I have an assumption that the hanging stop command is due to the nodes all waiting for an SST and having a split-brain regarding sequence number.
There is an option in the service description to add a TimeoutStopSec value that could be set to some reasonable length of time to allow for timeout of this service stop command to not hang up the bootstrap-pxc action.