commit a981e0df31805b2cc3feb0a795e5d6cb2cd70c88
Author: sue <sugar-2008@163.com>
Date: Wed Jun 2 16:38:05 2021 +0800
Fix hostmonitor hanging forever after certain exceptions
The hostmonitor, like other Masakari monitors, starts as an
Oslo service (based on eventlet). The main thread is supposed
to run a loop that has an internal wait mechanism (instead of
reusing periodic_tasks from oslo_service). However, the loop
could be broken, if an unexpected exception appeared, and it
never ran again but the process was still alive (due to
oslo_service not stopping). The example mentioned in the bug
report is about unavailability of the Masakari API (and/or
Keystone API) before notification sending. This exception is
not caught early because SendNotification._make_client is
called outside of the try block (unlike the actual notification
sending). The exception bubbles up and stops the main loop,
leaving a useless hostmonitor process. The user is unaware
unless they notice the logs are no longer growing.
While the general design begs for a revamp (we might get away
with that by using Consul in the first place), the easy fix is
to prevent exceptions breaking the loop completely so that the
hostmonitor can continue to work and try to regain health.
At the very least it will keep posting ERROR messages in the log
which is more likely to be spotted in comparison to lack of logs
(which is, unfortunately, less commonly considered an alerting
situation).
This change also fixes, adapts and robustifies the two relevant
unit tests.
Reviewed: https:/ /review. opendev. org/c/openstack /masakari- monitors/ +/802348 /opendev. org/openstack/ masakari- monitors/ commit/ a981e0df31805b2 cc3feb0a795e5d6 cb2cd70c88
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/victoria
commit a981e0df31805b2 cc3feb0a795e5d6 cb2cd70c88 2008@163. com>
Author: sue <sugar-
Date: Wed Jun 2 16:38:05 2021 +0800
Fix hostmonitor hanging forever after certain exceptions
The hostmonitor, like other Masakari monitors, starts as an n._make_ client is
Oslo service (based on eventlet). The main thread is supposed
to run a loop that has an internal wait mechanism (instead of
reusing periodic_tasks from oslo_service). However, the loop
could be broken, if an unexpected exception appeared, and it
never ran again but the process was still alive (due to
oslo_service not stopping). The example mentioned in the bug
report is about unavailability of the Masakari API (and/or
Keystone API) before notification sending. This exception is
not caught early because SendNotificatio
called outside of the try block (unlike the actual notification
sending). The exception bubbles up and stops the main loop,
leaving a useless hostmonitor process. The user is unaware
unless they notice the logs are no longer growing.
While the general design begs for a revamp (we might get away
with that by using Consul in the first place), the easy fix is
to prevent exceptions breaking the loop completely so that the
hostmonitor can continue to work and try to regain health.
At the very least it will keep posting ERROR messages in the log
which is more likely to be spotted in comparison to lack of logs
(which is, unfortunately, less commonly considered an alerting
situation).
This change also fixes, adapts and robustifies the two relevant
unit tests.
Closes-Bug: #1930361 8e3e3c30f4f0019 d91a99c79ce 6eec603a850ec94 1668eb602f)
Co-Authored-By: Radosław Piliszek <email address hidden>
Change-Id: I7e3447dcddc799
(cherry picked from commit e7154f3d77ee4c0