no-quorum-policy=ignore regardless of cluster size is dangerous and may exacerbate split brain

Bug #1354452 reported by Gareth Woolridge
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hacluster (Juju Charms Collection)
Fix Released
High
Liam Young

Bug Description

We recently experienced a split brain scenario in our HA environment where all nodes in our HA cluster grabbed the VIP following one of the instances crashing and the hardware restarting.

We have shmooshed infrastructure with the core Openstack HA services with 3 instances each under LXC on 3 physical nodes. This failure scenario was observed on all HA services when one physical node suffered a hardware related reboot.

crm status on these nodes showed the cluster was not quorate and the other 2 nodes were offline.

Bouncing corosync+pacemaker on the HA nodes restored normal operation but we then analysed logs for likely causes without much success.

crm configure show however shows "no-quorum-policy=ignore" to be set across HA clusters, confirmed to be set as part of configure_cluster by the charm.

An internet search seems to suggest this setting is required for a 2 node cluster, otherwise the service would fail if one node were down, but should not be set for larger clusters where it is not safe:

"Setting no-quorum-policy="ignore" is required in 2-node Pacemaker clusters for the following reason: if quorum enforcement is enabled, and one of the two nodes fails, then the remaining node can not establish a majority of quorum votes necessary to run services, and thus it is unable to take over any resources. The appropriate workaround is to ignore loss of quorum in the cluster. This is safe and necessary only in 2-node clusters. Do not set this property in Pacemaker clusters with more than two nodes. "

source: http://docs.openstack.org/high-availability-guide/content/_setting_basic_cluster_properties.html

We have manually set no-quorum-policy=stop for now on our 3 node cluster, the charm set this value appropriately depending on cluster size?

Tags: openstack

Related branches

James Page (james-page)
Changed in hacluster (Juju Charms Collection):
importance: Undecided → High
status: New → Triaged
Liam Young (gnuoy)
Changed in hacluster (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
tags: added: openstack
James Page (james-page)
Changed in hacluster (Juju Charms Collection):
status: Triaged → Fix Committed
Changed in hacluster (Juju Charms Collection):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.