puppet-pacemaker

Cluster fails when 2 controller nodes become down simultaneously | tripleo wallaby

Bug #1995156 reported by swogat pradhan on 2022-10-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	puppet-pacemaker	Invalid	Undecided	Unassigned

Bug Description

I have configured a 3 node pcs cluster for openstack.
To test the HA, i issue the following commands:
iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT &&
iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT &&
iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j ACCEPT &&
iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j ACCEPT &&
iptables -A INPUT ! -i lo -j REJECT --reject-with icmp-host-prohibited &&
iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT &&
iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT &&
iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT &&
iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited

When i issue iptables command on 1 node then it is fenced and forced to reboot and cluster works fine.
But when i issue this on 2 of the controller nodes the resource bundles fail and doesn't come back up.

[root@overcloud-controller-1 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: overcloud-controller-1 (version 2.1.2-4.el8-ada5c3b36e2) - partition WITHOUT quorum
  * Last updated: Sat Oct 29 03:15:29 2022
  * Last change: Sat Oct 29 03:12:26 2022 by root via crm_resource on overcloud-controller-1
  * 19 nodes configured
  * 68 resource instances configured

Node List:
  * Node overcloud-controller-0: UNCLEAN (offline)
  * Node overcloud-controller-2: UNCLEAN (offline)
  * Online: [ overcloud-controller-1 ]

Full List of Resources:
  * ip-172.25.201.91 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN)
  * ip-172.25.201.150 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN)
  * ip-172.25.201.206 (ocf::heartbeat:IPaddr2): Stopped
  * ip-172.25.201.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN)
  * ip-172.25.202.50 (ocf::heartbeat:IPaddr2): Stopped
  * ip-172.25.202.90 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN)
  * Container bundle set: haproxy-bundle [172.25.201.68:8787/tripleomaster/openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN)
    * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped
    * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started overcloud-controller-2 (UNCLEAN)
    * haproxy-bundle-podman-3 (ocf::heartbeat:podman): Stopped
  * Container bundle set: galera-bundle [172.25.201.68:8787/tripleomaster/openstack-mariadb:pcmklatest]:
    * galera-bundle-0 (ocf::heartbeat:galera): Stopped overcloud-controller-0 (UNCLEAN)
    * galera-bundle-1 (ocf::heartbeat:galera): Stopped
    * galera-bundle-2 (ocf::heartbeat:galera): Stopped overcloud-controller-2 (UNCLEAN)
    * galera-bundle-3 (ocf::heartbeat:galera): Stopped
  * Container bundle set: redis-bundle [172.25.201.68:8787/tripleomaster/openstack-redis:pcmklatest]:
    * redis-bundle-0 (ocf::heartbeat:redis): Stopped
    * redis-bundle-1 (ocf::heartbeat:redis): Stopped overcloud-controller-2 (UNCLEAN)
    * redis-bundle-2 (ocf::heartbeat:redis): Stopped overcloud-controller-0 (UNCLEAN)
    * redis-bundle-3 (ocf::heartbeat:redis): Stopped
  * Container bundle set: ovn-dbs-bundle [172.25.201.68:8787/tripleomaster/openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-2 (UNCLEAN)
    * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-0 (UNCLEAN)
    * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped
    * ovn-dbs-bundle-3 (ocf::ovn:ovndb-servers): Stopped
  * ip-172.25.201.208 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN)
  * Container bundle: openstack-cinder-backup [172.25.201.68:8787/tripleomaster/openstack-cinder-backup:pcmklatest]:
    * openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN)
  * Container bundle: openstack-cinder-volume [172.25.201.68:8787/tripleomaster/openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped
  * Container bundle set: rabbitmq-bundle [172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-2 (UNCLEAN)
    * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0 (UNCLEAN)
    * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped
    * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Stopped
  * ip-172.25.204.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN)
  * ceph-nfs (systemd:ceph-nfs@pacemaker): Started overcloud-controller-0 (UNCLEAN)
  * Container bundle: openstack-manila-share [172.25.201.68:8787/tripleomaster/openstack-manila-share:pcmklatest]:
    * openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN)
  * stonith-fence_ipmilan-48d539a11820 (stonith:fence_ipmilan): Stopped
  * stonith-fence_ipmilan-48d539a1188c (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN)
  * stonith-fence_ipmilan-246e96349068 (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN)
  * stonith-fence_ipmilan-246e96348d30 (stonith:fence_ipmilan): Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

PCS requires more than half the nodes to be alive for the clusterto work it seems.

Revision history for this message

Michele Baldessari (michele) wrote on 2022-10-29:

Correct, pacemaker will shut down services on nodes without quorum. This is by design.

Changed in puppet-pacemaker:
status:	New → Invalid

Revision history for this message

swogat pradhan (swogat) wrote on 2022-10-29:

Can i make any configurations, where even if 2 of my nodes goes down then still the cluster will be up in the last node?
i have tried the following settings:
    auto_tie_breaker: 1
    last_man_standing: 1
    wait_for_all: 1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.