Bug #1915045 “units stuck in insufficient peers even though ther...” : Bugs : OpenStack HA Cluster Charm

Jason Hobbs (jason-hobbs) on 2021-02-08

tags:

added: cdo-qa foundations-engine

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-08:

#1

This is blocking solutions QA release testing; sub'd to field high

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-02-09:

#2

Hi and thanks for reporting! This is strange because the only way this can happen is if the number of peer units is smaller than the config option `cluster_count` [0], which you have correctly set to 3. related_units() is called in order to determine that, however it is decorated with `@cache` [1]. I'm wondering if this can be the culprit.

[0] https://github.com/openstack/charm-hacluster/blob/master/hooks/utils.py#L1338
[1] https://github.com/openstack/charm-hacluster/blob/master/charmhelpers/core/hookenv.py#L539

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-02-09:

#3

This is a strange one, to be sure. I suspect we'd need to have a poke at a live (failed) system to see what is going on with the relation data to get to this state. Is it possible to give us a nudge in OpenStack charms team when this next occurs so we can take a look please?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2021-02-09: Re: [Bug 1915045] Re: units stuck in insufficient peers even though there are peers

#4

Download full text (3.7 KiB)

In general we're running automated tests, all the time.

Why isn't this relation data being logged? Is there a way to turn on
logging for it?

We can try to reproduce manually depending on time availability.

On Tue, Feb 9, 2021 at 5:15 AM Alex Kavanagh <email address hidden>
wrote:

> This is a strange one, to be sure. I suspect we'd need to have a poke
> at a live (failed) system to see what is going on with the relation data
> to get to this state. Is it possible to give us a nudge in OpenStack
> charms team when this next occurs so we can take a look please?
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1915045
>
> Title:
> units stuck in insufficient peers even though there are peers
>
> Status in OpenStack hacluster charm:
> New
>
> Bug description:
> In an HA deploy, we have 3 units of hacluster-octavia, but 2 say there
> aren't enough peers and remain blocked forever:
>
> octavia/0* blocked executing 0/lxd/8
> 10.244.40.104 9876/tcp 'shared-db' missing, 'amqp' missing,
> 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting end-user
> execution of `configure-resources` action to create required resources
> hacluster-octavia/0* blocked executing
> 10.244.40.104 Insufficient peer units for ha cluster
> (require 3)
> logrotated/21 active executing
> 10.244.40.104 (config-changed) Unit is ready.
> neutron-openvswitch-octavia/0* maintenance executing
> 10.244.40.104 (config-changed) Configuring ovs
> public-policy-routing/11 active executing
> 10.244.40.104 (start) Unit is ready
> octavia/1 blocked executing 2/lxd/8
> 10.244.41.66 9876/tcp 'shared-db' incomplete, 'amqp' missing,
> 'identity-service' missing, 'sdn-subordinate' missing, Awaiting leader to
> create required resources
> hacluster-octavia/2 active executing
> 10.244.41.66 (leader-settings-changed) Unit is ready
> and clustered
> logrotated/61 active executing
> 10.244.41.66 (config-changed) Unit is ready.
> neutron-openvswitch-octavia/2 maintenance executing
> 10.244.41.66 (config-changed) Configuring ovs
> public-policy-routing/46 active executing
> 10.244.41.66 (config-changed) Unit is ready
> octavia/2 blocked executing 4/lxd/9
> 10.244.41.26 9876/tcp 'shared-db' missing, 'amqp' missing,
> 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting leader
> to create required resources
> hacluster-octavia/1 blocked executing
> 10.244.41.26 (config-changed) Insufficient peer units
> for ha cluster (require 3)
> logrotated/58 active executing
> 10.244.41.2...

In general we're running automated tests, all the time.

Why isn't this relation data being logged? Is there a way to turn on
logging for it?

We can try to reproduce manually depending on time availability.

On Tue, Feb 9, 2021 at 5:15 AM Alex Kavanagh <1915045@bugs.launchpad.net>
wrote:

> This is a strange one, to be sure.  I suspect we'd need to have a poke
> at a live (failed) system to see what is going on with the relation data
> to get to this state.  Is it possible to give us a nudge in OpenStack
> charms team when this next occurs so we can take a look please?
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1915045
>
> Title:
>   units stuck in insufficient peers even though there are peers
>
> Status in OpenStack hacluster charm:
>   New
>
> Bug description:
>   In an HA deploy, we have 3 units of hacluster-octavia, but 2 say there
>   aren't enough peers and remain blocked forever:
>
>   octavia/0*                            blocked      executing  0/lxd/8
>  10.244.40.104   9876/tcp           'shared-db' missing, 'amqp' missing,
> 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting end-user
> execution of `configure-resources` action to create required resources
>     hacluster-octavia/0*                blocked      executing
> 10.244.40.104                      Insufficient peer units for ha cluster
> (require 3)
>     logrotated/21                       active       executing
> 10.244.40.104                      (config-changed) Unit is ready.
>     neutron-openvswitch-octavia/0*      maintenance  executing
> 10.244.40.104                      (config-changed) Configuring ovs
>     public-policy-routing/11            active       executing
> 10.244.40.104                      (start) Unit is ready
>   octavia/1                             blocked      executing  2/lxd/8
>  10.244.41.66    9876/tcp           'shared-db' incomplete, 'amqp' missing,
> 'identity-service' missing, 'sdn-subordinate' missing, Awaiting leader to
> create required resources
>     hacluster-octavia/2                 active       executing
> 10.244.41.66                       (leader-settings-changed) Unit is ready
> and clustered
>     logrotated/61                       active       executing
> 10.244.41.66                       (config-changed) Unit is ready.
>     neutron-openvswitch-octavia/2       maintenance  executing
> 10.244.41.66                       (config-changed) Configuring ovs
>     public-policy-routing/46            active       executing
> 10.244.41.66                       (config-changed) Unit is ready
>   octavia/2                             blocked      executing  4/lxd/9
>  10.244.41.26    9876/tcp           'shared-db' missing, 'amqp' missing,
> 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting leader
> to create required resources
>     hacluster-octavia/1                 blocked      executing
> 10.244.41.26                       (config-changed) Insufficient peer units
> for ha cluster (require 3)
>     logrotated/58                       active       executing
> 10.244.41.26                       (config-changed) Unit is ready.
>     neutron-openvswitch-octavia/1       maintenance  executing
> 10.244.41.26                       (config-changed) Configuring ovs
>     public-policy-routing/44            active       executing
> 10.244.41.26                       (start) Unit is ready
>
>   Example test run:
>
> https://solutions.qa.canonical.com/testruns/testRun/3675e02f-09b9-4695-9696-ae6ae7f4921d
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-hacluster/+bug/1915045/+subscriptions
>

Alex Kavanagh (ajkavanagh) on 2021-02-26

Changed in charm-hacluster:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-04-28:

#5

[Expired for OpenStack hacluster charm because there has been no activity for 60 days.]

Changed in charm-hacluster:
status:	Incomplete → Expired

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-09-06:

#6

We have another occurrence: https://solutions.qa.canonical.com/testruns/testRun/ab348755-9e31-4220-9ac8-3a7680883b57 . The crashdumps are here: https://oil-jenkins.canonical.com/artifacts/ab348755-9e31-4220-9ac8-3a7680883b57/index.html.
I suspect that it is an issue with juju relations not rendering completely.

Changed in charm-hacluster:
status:	Expired → New

Revision history for this message

Moises Emilio Benzan Mora (moisesbenzan) wrote on 2023-07-18:

#7

New occurrence: https://solutions.qa.canonical.com/testruns/ada16086-16c3-4dd7-9b01-80ee9552bef5

Crashdumps here: https://oil-jenkins.canonical.com/artifacts/ada16086-16c3-4dd7-9b01-80ee9552bef5/index.html

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-07-20:

#8

Download full text (3.4 KiB)

re: comment #7, the crashdump's juju-status.txt does have hacluster-placement saying 'Insufficient ..', but the actual units themselves seem fine:

placement/0 active idle 0/lxd/12 10.246.167.188 8778/tcp Unit is ready
  filebeat/70 active idle 10.246.167.188 Filebeat ready.
  hacluster-placement/2 active idle 10.246.167.188 Unit is ready and clustered
  landscape-client/66 active idle 10.246.167.188 System successfully registered
  logrotated/66 active idle 10.246.167.188 Unit is ready.
  nrpe/79 active idle 10.246.167.188 icmp,5666/tcp Ready
  placement-mysql-router/2 active idle 10.246.167.188 Unit is ready
  prometheus-grok-exporter/63 active idle 10.246.167.188 9144/tcp Unit is ready
  public-policy-routing/42 active idle 10.246.167.188 Unit is ready
  telegraf/70 active idle 10.246.167.188 9103/tcp Monitoring placement/0 (source version/commit 23.01-8-...)
  ubuntu-advantage/67 active idle 10.246.167.188 Attached (esm-apps,esm-infra)
placement/1* active idle 1/lxd/11 10.246.165.31 8778/tcp Unit is ready
  filebeat/52 active idle 10.246.165.31 Filebeat ready.
  hacluster-placement/1 active idle 10.246.165.31 Unit is ready and clustered
  landscape-client/50 active idle 10.246.165.31 System successfully registered
  logrotated/51 active idle 10.246.165.31 Unit is ready.
  nrpe/67 active idle 10.246.165.31 icmp,5666/tcp Ready
  placement-mysql-router/1 active idle 10.246.165.31 Unit is ready
  prometheus-grok-exporter/53 active idle 10.246.165.31 9144/tcp Unit is ready
  public-policy-routing/32 active idle 10.246.165.31 Unit is ready
  telegraf/51 active idle 10.246.165.31 9103/tcp Monitoring placement/1 (source version/commit 23.01-8-...)
  ubuntu-advantage/56 active idle 10.246.165.31 Attached (esm-apps,esm-infra)
placement/2 active idle 2/lxd/11 10.246.167.191 8778/tcp Unit is ready
  filebeat/54 active idle 10.246.167.191 Filebeat re...

re: comment #7, the crashdump's juju-status.txt does have hacluster-placement saying 'Insufficient  ..', but the actual units themselves seem fine:

placement/0                                 active    idle   0/lxd/12  10.246.167.188  8778/tcp           Unit is ready
  filebeat/70                               active    idle             10.246.167.188                     Filebeat ready.
  hacluster-placement/2                     active    idle             10.246.167.188                     Unit is ready and clustered
  landscape-client/66                       active    idle             10.246.167.188                     System successfully registered
  logrotated/66                             active    idle             10.246.167.188                     Unit is ready.
  nrpe/79                                   active    idle             10.246.167.188  icmp,5666/tcp      Ready
  placement-mysql-router/2                  active    idle             10.246.167.188                     Unit is ready
  prometheus-grok-exporter/63               active    idle             10.246.167.188  9144/tcp           Unit is ready
  public-policy-routing/42                  active    idle             10.246.167.188                     Unit is ready
  telegraf/70                               active    idle             10.246.167.188  9103/tcp           Monitoring placement/0 (source version/commit 23.01-8-...)
  ubuntu-advantage/67                       active    idle             10.246.167.188                     Attached (esm-apps,esm-infra)
placement/1*                                active    idle   1/lxd/11  10.246.165.31   8778/tcp           Unit is ready
  filebeat/52                               active    idle             10.246.165.31                      Filebeat ready.
  hacluster-placement/1                     active    idle             10.246.165.31                      Unit is ready and clustered
  landscape-client/50                       active    idle             10.246.165.31                      System successfully registered
  logrotated/51                             active    idle             10.246.165.31                      Unit is ready.
  nrpe/67                                   active    idle             10.246.165.31   icmp,5666/tcp      Ready
  placement-mysql-router/1                  active    idle             10.246.165.31                      Unit is ready
  prometheus-grok-exporter/53               active    idle             10.246.165.31   9144/tcp           Unit is ready
  public-policy-routing/32                  active    idle             10.246.165.31                      Unit is ready
  telegraf/51                               active    idle             10.246.165.31   9103/tcp           Monitoring placement/1 (source version/commit 23.01-8-...)
  ubuntu-advantage/56                       active    idle             10.246.165.31                      Attached (esm-apps,esm-infra)
placement/2                                 active    idle   2/lxd/11  10.246.167.191  8778/tcp           Unit is ready
  filebeat/54                               active    idle             10.246.167.191                     Filebeat ready.
  hacluster-placement/0*                    active    idle             10.246.167.191                     Unit is ready and clustered

Thus, I'd say that either the error was in Juju itself, or it hadn't had time to bubble through to the main status.

OpenStack HA Cluster Charm

units stuck in insufficient peers even though there are peers

Bug Description

Other bug subscribers

Remote bug watches