While the mariadb_recovery playbook has never been the shining pinnacle of Galera recovery that it could be, it is in a pretty bad state right now.
The playbook in the past was "supply recovery host or it will recover the first host in the mariadb group (mariadb[0])". Now it is attempting to be a bit smarter and properly read the grastate.dat, but not accounting for non-graceful shutdowns.
A patch added prior to Newton reads the grastate.dat and tries to parse it for the highest seqno. The data-loss scenario is when you have shutdown a node gracefully, then some time passes, then your cluster crashes. This isn't uncommon and shouldn't be dismissed. When that happens the playbooks will choose the old, gracefully shutdown node and stomp the data on the rest of the nodes. This is all done without user interaction and is exceedingly dangerous since no backup is done either.
The proper recovery method that works automated a good portion of the time is as follows:
* Check if all mariadb nodes are stopped
* if not stopped then do not recover
* Check if any mariadb nodes have gvwstate.dat
* if gvwstate.dat found start *only* the nodes with gvwstate.dat without special options
* This is not a garaunteed recovery, but it is a safe action (no data loss can occur)
* Check if any mariadb nodes have grastate.dat
* If grastate.dat exists and has a seqno of -1 on any node, it is not safe to autorecover, abort
* if no -1 exists on any node, bootstrap the node with the highest seqno
This will cover a good chunk of the failure scenarios, including graceful shutdowns and full cluster outages without any risk of data. If it cannot automatically recover then the user should be *forced* to supply a bootstrap node on the command line based on whatever critera they want (sometimes guessing, but guessing should be done by the user, not Kolla-Ansible).
I also meet it.