Cluster fails when 2 controller nodes become down simultaneously | tripleo wallaby

Bug #1995156 reported by swogat pradhan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
puppet-pacemaker
Invalid
Undecided
Unassigned

Bug Description

I have configured a 3 node pcs cluster for openstack.
To test the HA, i issue the following commands:
iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT &&
iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT &&
iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j ACCEPT &&
iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j ACCEPT &&
iptables -A INPUT ! -i lo -j REJECT --reject-with icmp-host-prohibited &&
iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT &&
iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT &&
iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT &&
iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited

When i issue iptables command on 1 node then it is fenced and forced to reboot and cluster works fine.
But when i issue this on 2 of the controller nodes the resource bundles fail and doesn't come back up.

[root@overcloud-controller-1 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: overcloud-controller-1 (version 2.1.2-4.el8-ada5c3b36e2) - partition WITHOUT quorum
  * Last updated: Sat Oct 29 03:15:29 2022
  * Last change: Sat Oct 29 03:12:26 2022 by root via crm_resource on overcloud-controller-1
  * 19 nodes configured
  * 68 resource instances configured

Node List:
  * Node overcloud-controller-0: UNCLEAN (offline)
  * Node overcloud-controller-2: UNCLEAN (offline)
  * Online: [ overcloud-controller-1 ]

Full List of Resources:
  * ip-172.25.201.91 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN)
  * ip-172.25.201.150 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN)
  * ip-172.25.201.206 (ocf::heartbeat:IPaddr2): Stopped
  * ip-172.25.201.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN)
  * ip-172.25.202.50 (ocf::heartbeat:IPaddr2): Stopped
  * ip-172.25.202.90 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN)
  * Container bundle set: haproxy-bundle [172.25.201.68:8787/tripleomaster/openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN)
    * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped
    * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started overcloud-controller-2 (UNCLEAN)
    * haproxy-bundle-podman-3 (ocf::heartbeat:podman): Stopped
  * Container bundle set: galera-bundle [172.25.201.68:8787/tripleomaster/openstack-mariadb:pcmklatest]:
    * galera-bundle-0 (ocf::heartbeat:galera): Stopped overcloud-controller-0 (UNCLEAN)
    * galera-bundle-1 (ocf::heartbeat:galera): Stopped
    * galera-bundle-2 (ocf::heartbeat:galera): Stopped overcloud-controller-2 (UNCLEAN)
    * galera-bundle-3 (ocf::heartbeat:galera): Stopped
  * Container bundle set: redis-bundle [172.25.201.68:8787/tripleomaster/openstack-redis:pcmklatest]:
    * redis-bundle-0 (ocf::heartbeat:redis): Stopped
    * redis-bundle-1 (ocf::heartbeat:redis): Stopped overcloud-controller-2 (UNCLEAN)
    * redis-bundle-2 (ocf::heartbeat:redis): Stopped overcloud-controller-0 (UNCLEAN)
    * redis-bundle-3 (ocf::heartbeat:redis): Stopped
  * Container bundle set: ovn-dbs-bundle [172.25.201.68:8787/tripleomaster/openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-2 (UNCLEAN)
    * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-0 (UNCLEAN)
    * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped
    * ovn-dbs-bundle-3 (ocf::ovn:ovndb-servers): Stopped
  * ip-172.25.201.208 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN)
  * Container bundle: openstack-cinder-backup [172.25.201.68:8787/tripleomaster/openstack-cinder-backup:pcmklatest]:
    * openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN)
  * Container bundle: openstack-cinder-volume [172.25.201.68:8787/tripleomaster/openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped
  * Container bundle set: rabbitmq-bundle [172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-2 (UNCLEAN)
    * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0 (UNCLEAN)
    * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped
    * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Stopped
  * ip-172.25.204.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN)
  * ceph-nfs (systemd:ceph-nfs@pacemaker): Started overcloud-controller-0 (UNCLEAN)
  * Container bundle: openstack-manila-share [172.25.201.68:8787/tripleomaster/openstack-manila-share:pcmklatest]:
    * openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN)
  * stonith-fence_ipmilan-48d539a11820 (stonith:fence_ipmilan): Stopped
  * stonith-fence_ipmilan-48d539a1188c (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN)
  * stonith-fence_ipmilan-246e96349068 (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN)
  * stonith-fence_ipmilan-246e96348d30 (stonith:fence_ipmilan): Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

PCS requires more than half the nodes to be alive for the clusterto work it seems.

Revision history for this message
Michele Baldessari (michele) wrote :

Correct, pacemaker will shut down services on nodes without quorum. This is by design.

Changed in puppet-pacemaker:
status: New → Invalid
Revision history for this message
swogat pradhan (swogat) wrote :

Can i make any configurations, where even if 2 of my nodes goes down then still the cluster will be up in the last node?
i have tried the following settings:
    auto_tie_breaker: 1
    last_man_standing: 1
    wait_for_all: 1

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.