HA: galera cannot recover from a network split on a 2-node
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Undecided
|
Damien Ciabrini |
Bug Description
Galera and pacemaker both have their own notion of quorum. When a
network split occurs in a two node overcloud, both node becomes
inquorate, per galera and pacemaker point of view.
The pacemaker resource agent always demotes a node when it loses
galera quorum; however it cannot promote it back because it waits for
the other node to advertise its DB sequence number in the CIB, and
that information is unavailable during the network split.
Pacemaker can recover from its quorum loss if one of the node
manages to fence the other peer. From that moment onward, the
pacemaker cluster is unblocked and the HA services can be restarted
and run on a single node temporarily.
However, the galera resource agent is currently not able to take any
automatic decision to restart the resource, even after pacemaker has
fenced the other node and determined it's the surviving node in the
cluster.
So the DB service stays down and cannot recover until the network
disruption is resolved.
Changed in tripleo: | |
status: | Triaged → In Progress |
Reviewed: https:/ /review. opendev. org/758153 /git.openstack. org/cgit/ openstack/ puppet- tripleo/ commit/ ?id=5836bcc15b3 e28160b13f25e10 22a79002a71dd2
Committed: https:/
Submitter: Zuul
Branch: master
commit 5836bcc15b3e281 60b13f25e1022a7 9002a71dd2
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200
galera: expose 2-node mode for the galera resource
When deploying a 2-node HA overcloud, the galera resource
agent can be configured to enable a "2-node mode" heuristic,
that allows it to restart a galera node in the event of a
network split.
Make this resource agent's option available in puppet via
the new parameter "two_node_mode".
Closes-Bug: #1903051
Change-Id: I543ee77ec38b64 29989435122ae0c 257d279e507