There are cases when masakari-hostmonitor will recognize online nodes as offline and send (in)appropriate notifications to Masakari

Bug #1878548 reported by Daisuke Suzuki
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Ussuri
Fix Committed
High
Unassigned
Victoria
Fix Committed
High
Unassigned
Wallaby
Fix Committed
High
Unassigned
masakari-monitors
Fix Released
High
Radosław Piliszek
Ussuri
Fix Released
High
Radosław Piliszek
Victoria
Fix Released
High
Radosław Piliszek
Wallaby
Fix Released
High
Radosław Piliszek
Xena
Fix Released
High
Radosław Piliszek
masakari-monitors (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Committed
High
Unassigned

Bug Description

[Issue]
ComputeNodes are managed by pacemaker_remote in my environment.
When one ComputeNode is isolated in the network, masakari-hostmonitors on the other ComputeNodes will send failure notification about the isolated ComputeNode to masakari-api.
At that time, the isolated masakari-hostomonitor will recognize other ComputeNodes as offline. So it sends failure notification about online ComputeNodes.
As a result, masakari-engine runs the recovery procedure to online ComputeNodes.

[Cause]
The current masakari-hostmonitor can't determine whether or not it is isolated in the network if ComputeNodes are managed by pacemaker_remote.

masakari-hostmonitor with pacemaker(not remote) will wait until it is killed if it is isolated in the network. It is implemented in the following code.
<https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/hostmonitor/host_handler/handle_host.py#L398-L402>

But masakari-hostmonitor with pacemaker_remote won't determine if it is isolated.
<https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/hostmonitor/host_handler/handle_host.py#L93-L95>

[Solution]
The ComputeNode managed by pacemaker_remote should determine recognize itself as offline when it is isolated.
The state monitoring process should be skipped in that case.

See comment #11 for how yoctozepto managed to reproduce something similar to the described.

Changed in masakari-monitors:
assignee: nobody → Daisuke Suzuki (suzuki-di)
status: New → In Progress
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

_check_host_status_by_crmadmin [1] is the proper safeguard.
Hostmonitor should be treated as pacemaker proxy so should run on pacemaker nodes (not remotes).
I guess this needs documenting and disabling its functionality on non-pacemaker nodes altogether.
There is no benefit to running hostmonitors on remotes, it can only result in more resource waste and less stability.

[1] https://opendev.org/openstack/masakari-monitors/src/commit/b02c6b6931c0256f4ce6d7167c97ebb849ff3453/masakarimonitors/hostmonitor/host_handler/handle_host.py#L414-L418

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I could not reproduce. An isolated remote node fails its crm_mon invocation and prevents hostmonitor from acting at all.

crm_mon logs:

Error: cluster is not available on this node

pacemaker_remoted logs:

warning: Cannot proxy request from uid 0 gid 0 because not connected to cluster
  error: Error in connection setup (/dev/shm/qb-1050-23133-15-M2JrMP/qb): Remote I/O error (121)

Changed in masakari-monitors:
status: In Progress → Incomplete
Revision history for this message
Daisuke Suzuki (suzuki-di) wrote :

> I could not reproduce. An isolated remote node fails its crm_mon invocation and prevents hostmonitor from acting at all.

You need to install crmsh on the remote node.
If crmsh is installed on the remote node, hostmonitor can execute the crm_mon command and monitor each remote node's status.
Then, I think you can reproduce this issue.
Please let me know your opinion on this.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Hah, crmsh is out of availability on CentOS 8.

Thanks for the info. I'll try PoCing on Ubuntu 20.04.

It's intriguing that crmsh would change the cluster behaviour (that's pretty dangerous).

Could you share the exact instructions on how you set up your cluster to achieve this?

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Ping. Daisuke?

Revision history for this message
Daisuke Suzuki (suzuki-di) wrote :

In our environment, we set up the cluster by the following steps.

1. Prepare Controller Node (1 or more) and Compute Node (3 or more).
2. Install corosync, pacemaker, crmsh, masakari-api, masakari-engine on the Controller Node.
3.Install pacemaker_remote, crmsh, masakari-hostmonitor[1] on Compute Nodes.
4. Manage pacemaker_remote cluster on Compute Nodes by the pacemaker on the Controller Node.[2]

[1]
In our environment, we deployed a masakari-hostmonitor on Cpmpute Nodes. But you can also deploy it on Controller Nodes.

[2]
In order to manage the pacemaker_remote cluster of Compute Nodes, set remote RA related to each Compute Node in crm.

This is the pacemaker-remote RA settings. You should set for all Compute Nodes managed by pacemaker-remote cluster.

-----
     primitive <defined name of remote node> ocf:pacemaker:remote \
      params reconnect_interval=10 server=<Host name or IP address of the Compute Node> \
      op migrate_from interval=0s timeout=60000 \
      op migrate_to interval=0s timeout=60000 \
      op monitor interval=20s timeout=180000 \
      op reload interval=0s timeout=60000 \
      op start interval=0s timeout=60000 \
      op stop interval=0s timeout=60000
------

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Thanks.

I would like the crmsh invocations as well (never used it in fact, only pcs and raw).

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Hmm, it seems the link to the patch is not present in the thread so hereby I am posting it now:

https://review.opendev.org/c/openstack/masakari-monitors/+/729206

My primary concern with the patch is that it adds extra complexity and the deployment discussed here is simply not recommended (also due to performance and stability reasons) - hostmonitors should be placed on cluster nodes, not remotes.
Finally, I was unable (at the time) to spin up a local reproducer. I suppose this is strongly related to the usage of crmsh doing its extra magic.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I invite you to join our Masakari meeting to discuss this: https://wiki.openstack.org/wiki/Meetings/Masakari

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I was unable to reproduce this (tried again with crmsh to be 100% fair). There must be something really peculiar about your Pacemaker+Corosync config. crmsh does not affect the outcome here. The isolated node is unable to provide the crm_mon output and thus unable to act with hostmonitor. The way pacemaker-remote is wired is that it cannot know cluster info if it is not online in that cluster. If it is not the case, I would suspect there is some specific config (or perhaps Pacemaker/Corosync version?) in place that misbehaves. All in all, I will be *deprecating support for running hostmonitors on remotes*. The reasoning is simple - it brings no benefits, only complexities. The cluster has to be contacted. The hostmonitors act like controller services, proxying the Pacemaker info into Masakari.

Changed in masakari-monitors:
status: Incomplete → Invalid
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

OK, I managed to reproduce this (or close to this) issue... but in a different setup than what I understood.
Do note the scenario is very artificial so it is very unlikely to happen in real life (but something similar could still...).

Here is the setup I managed to cause Masakari trash all the hosts:

3 controllers (all APIs, DBs, Masakari Engine and Masakari hostmonitor)
some computes, all running pacemaker_remote

hostmonitors configured to monitor only remotes (as all computes are remotes here)

Blocking corosync traffic to one of the controllers makes it become isolated and lose quorum and think all the other nodes are offline. Hostmonitor is happy to tell Masakari all remote nodes are offline...

Changed in masakari-monitors:
status: Invalid → Triaged
importance: Undecided → High
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Also, the proposed fix does not help in the situation I described. The local node is always 'online' (not to mention non-remotes are filtered out when restricting to remotes). What's more, the "non-restricted" version is broken as well as it does not react properly on the lack of quorum...

(And, finally, I have found a bunch of other, perhaps lesser, issues with the hostmonitor. All thanks to deep debugging and code analysis to review feature patches.)

summary: - There are cases when masakari-hostmonitor will recognize online
- ComputeNodes as offline if ComputeNodes are managed by pacemaker_remote
+ There are cases when masakari-hostmonitor will recognize online nodes as
+ offline and send (in)appropriate notifications to Masakari
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I will make this a priority of mine for the next cycle.

description: updated
Changed in masakari-monitors:
assignee: Daisuke Suzuki (suzuki-di) → Radosław Piliszek (yoctozepto)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (master)
Changed in masakari-monitors:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on masakari-monitors (master)

Change abandoned by "Radosław Piliszek <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/729206
Reason: obsoleted by https://review.opendev.org/c/openstack/masakari-monitors/+/808821

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (master)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/808821
Committed: https://opendev.org/openstack/masakari-monitors/commit/c2d9a4f9cb050faf5b863acb0b5225195a8c6fe8
Submitter: "Zuul (22348)"
Branch: master

commit c2d9a4f9cb050faf5b863acb0b5225195a8c6fe8
Author: Radosław Piliszek <email address hidden>
Date: Mon Sep 13 19:27:52 2021 +0000

    Fix hostmonitor to respect quorum

    Both cibadmin-based and crm_mon-based host status queryings were
    affected, allowing partitioned cluster to tell Masakari to
    evacuate hosts from the other partition (which nota bene include
    all remotes if applicable).

    Closes-Bug: #1878548
    Change-Id: I0b1ca8a011ee4da162a2c3a986c1dab9a3d38190

Changed in masakari-monitors:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/808868

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/808869

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to masakari-monitors (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/masakari-monitors/+/808970

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 12.0.0.0rc1

This issue was fixed in the openstack/masakari-monitors 12.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/808868
Committed: https://opendev.org/openstack/masakari-monitors/commit/d090789f1247772cc37b0b7238d67f20ab644540
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit d090789f1247772cc37b0b7238d67f20ab644540
Author: Radosław Piliszek <email address hidden>
Date: Mon Sep 13 19:27:52 2021 +0000

    Fix hostmonitor to respect quorum

    Both cibadmin-based and crm_mon-based host status queryings were
    affected, allowing partitioned cluster to tell Masakari to
    evacuate hosts from the other partition (which nota bene include
    all remotes if applicable).

    Closes-Bug: #1878548
    Change-Id: I0b1ca8a011ee4da162a2c3a986c1dab9a3d38190
    (cherry picked from commit c2d9a4f9cb050faf5b863acb0b5225195a8c6fe8)

tags: added: in-stable-wallaby
tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/808869
Committed: https://opendev.org/openstack/masakari-monitors/commit/d47629f4171637bdcdfe85efe857b6aa85f839f6
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit d47629f4171637bdcdfe85efe857b6aa85f839f6
Author: Radosław Piliszek <email address hidden>
Date: Mon Sep 13 19:27:52 2021 +0000

    Fix hostmonitor to respect quorum

    Both cibadmin-based and crm_mon-based host status queryings were
    affected, allowing partitioned cluster to tell Masakari to
    evacuate hosts from the other partition (which nota bene include
    all remotes if applicable).

    Closes-Bug: #1878548
    Change-Id: I0b1ca8a011ee4da162a2c3a986c1dab9a3d38190
    (cherry picked from commit c2d9a4f9cb050faf5b863acb0b5225195a8c6fe8)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to masakari-monitors (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/masakari-monitors/+/808970
Committed: https://opendev.org/openstack/masakari-monitors/commit/c599293b39d29835811ff3e3c202ade3fe2c128e
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit c599293b39d29835811ff3e3c202ade3fe2c128e
Author: Radosław Piliszek <email address hidden>
Date: Mon Sep 13 19:27:52 2021 +0000

    Fix hostmonitor to respect quorum

    Both cibadmin-based and crm_mon-based host status queryings were
    affected, allowing partitioned cluster to tell Masakari to
    evacuate hosts from the other partition (which nota bene include
    all remotes if applicable).

    Closes-Bug: #1878548
    Change-Id: I0b1ca8a011ee4da162a2c3a986c1dab9a3d38190
    (cherry picked from commit c2d9a4f9cb050faf5b863acb0b5225195a8c6fe8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 9.0.3

This issue was fixed in the openstack/masakari-monitors 9.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 10.0.2

This issue was fixed in the openstack/masakari-monitors 10.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/masakari-monitors 11.0.2

This issue was fixed in the openstack/masakari-monitors 11.0.2 release.

Changed in masakari-monitors (Ubuntu):
status: New → Fix Released
Changed in masakari-monitors (Ubuntu Focal):
status: New → Triaged
importance: Undecided → High
Changed in cloud-archive:
status: New → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Ubuntu SRU Details:

[Impact]
See bug description.

[Test Case]
See comment #11.

[Regression Potential]
It is important that our testing confirms that state monitoring is skipped for compute nodes that are managed by pacemaker_remote when they are offline and isolated. This patch has been fixed in upstream masakari-monitors for over a year now and masakari-monitors in Xena+ (Impish) have been fixed for a while now.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Daisuke, or anyone else affected,

Accepted masakari-monitors into wallaby-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:wallaby-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-wallaby-needed to verification-wallaby-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-wallaby-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-wallaby-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Daisuke, or anyone else affected,

Accepted masakari-monitors into victoria-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:victoria-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-victoria-needed to verification-victoria-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-victoria-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-victoria-needed
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Daisuke, or anyone else affected,

Accepted masakari-monitors into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/masakari-monitors/9.0.0-0ubuntu0.20.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in masakari-monitors (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Daisuke, or anyone else affected,

Accepted masakari-monitors into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.