tripleo

HA: galera cannot recover from a network split on a 2-node

Bug #1903051 reported by Damien Ciabrini on 2020-11-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Undecided	Damien Ciabrini	tripleo wallaby-1

Bug Description

Galera and pacemaker both have their own notion of quorum. When a
network split occurs in a two node overcloud, both node becomes
inquorate, per galera and pacemaker point of view.

The pacemaker resource agent always demotes a node when it loses
galera quorum; however it cannot promote it back because it waits for
the other node to advertise its DB sequence number in the CIB, and
that information is unavailable during the network split.

Pacemaker can recover from its quorum loss if one of the node
manages to fence the other peer. From that moment onward, the
pacemaker cluster is unblocked and the HA services can be restarted
and run on a single node temporarily.

However, the galera resource agent is currently not able to take any
automatic decision to restart the resource, even after pacemaker has
fenced the other node and determined it's the surviving node in the
cluster.

So the DB service stays down and cannot recover until the network
disruption is resolved.

Tags:

OpenStack Infra (hudson-openstack) on 2020-11-05

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-09: Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/758153
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5836bcc15b3e28160b13f25e1022a79002a71dd2
Submitter: Zuul
Branch: master

commit 5836bcc15b3e28160b13f25e1022a79002a71dd2
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

galera: expose 2-node mode for the galera resource

    When deploying a 2-node HA overcloud, the galera resource
    agent can be configured to enable a "2-node mode" heuristic,
    that allows it to restart a galera node in the event of a
    network split.

Make this resource agent's option available in puppet via
the new parameter "two_node_mode".

Closes-Bug: #1903051

Change-Id: I543ee77ec38b6429989435122ae0c257d279e507

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-09: Fix proposed to puppet-tripleo (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/761992

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-13: Fix merged to puppet-tripleo (stable/victoria)

Reviewed: https://review.opendev.org/761992
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=24d0ff54f38a1dc1fcc59e8999b0b55989e12070
Submitter: Zuul
Branch: stable/victoria

commit 24d0ff54f38a1dc1fcc59e8999b0b55989e12070
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

galera: expose 2-node mode for the galera resource

Make this resource agent's option available in puppet via
the new parameter "two_node_mode".

Closes-Bug: #1903051

Change-Id: I543ee77ec38b6429989435122ae0c257d279e507
(cherry picked from commit 5836bcc15b3e28160b13f25e1022a79002a71dd2)

tags:

added: in-stable-victoria

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-13: Fix proposed to puppet-tripleo (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/762675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-16: Fix merged to puppet-tripleo (stable/ussuri)

Reviewed: https://review.opendev.org/762675
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=7e7aa969b78acd7196cbe993173e26033744fbe2
Submitter: Zuul
Branch: stable/ussuri

commit 7e7aa969b78acd7196cbe993173e26033744fbe2
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

galera: expose 2-node mode for the galera resource

Make this resource agent's option available in puppet via
the new parameter "two_node_mode".

Closes-Bug: #1903051

    Change-Id: I543ee77ec38b6429989435122ae0c257d279e507
    (cherry picked from commit 5836bcc15b3e28160b13f25e1022a79002a71dd2)
    (cherry picked from commit 24d0ff54f38a1dc1fcc59e8999b0b55989e12070)

tags:

added: in-stable-ussuri

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-16: Fix proposed to puppet-tripleo (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/762837

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-19: Fix merged to puppet-tripleo (stable/train)

Reviewed: https://review.opendev.org/762837
Committed: https://opendev.org/openstack/puppet-tripleo/commit/3e9b801d5d4843cb767478567a89aadbeb2d07c7
Submitter: Zuul
Branch: stable/train

commit 3e9b801d5d4843cb767478567a89aadbeb2d07c7
Author: Damien Ciabrini <email address hidden>
Date: Wed Oct 14 16:59:27 2020 +0200

galera: expose 2-node mode for the galera resource

Make this resource agent's option available in puppet via
the new parameter "two_node_mode".

Closes-Bug: #1903051

    Change-Id: I543ee77ec38b6429989435122ae0c257d279e507
    (cherry picked from commit 5836bcc15b3e28160b13f25e1022a79002a71dd2)
    (cherry picked from commit 24d0ff54f38a1dc1fcc59e8999b0b55989e12070)
    (cherry picked from commit 7e7aa969b78acd7196cbe993173e26033744fbe2)