units stuck in insufficient peers even though there are peers

Bug #1915045 reported by Jason Hobbs
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
New
Undecided
Unassigned

Bug Description

In an HA deploy, we have 3 units of hacluster-octavia, but 2 say there aren't enough peers and remain blocked forever:

octavia/0* blocked executing 0/lxd/8 10.244.40.104 9876/tcp 'shared-db' missing, 'amqp' missing, 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting end-user execution of `configure-resources` action to create required resources
  hacluster-octavia/0* blocked executing 10.244.40.104 Insufficient peer units for ha cluster (require 3)
  logrotated/21 active executing 10.244.40.104 (config-changed) Unit is ready.
  neutron-openvswitch-octavia/0* maintenance executing 10.244.40.104 (config-changed) Configuring ovs
  public-policy-routing/11 active executing 10.244.40.104 (start) Unit is ready
octavia/1 blocked executing 2/lxd/8 10.244.41.66 9876/tcp 'shared-db' incomplete, 'amqp' missing, 'identity-service' missing, 'sdn-subordinate' missing, Awaiting leader to create required resources
  hacluster-octavia/2 active executing 10.244.41.66 (leader-settings-changed) Unit is ready and clustered
  logrotated/61 active executing 10.244.41.66 (config-changed) Unit is ready.
  neutron-openvswitch-octavia/2 maintenance executing 10.244.41.66 (config-changed) Configuring ovs
  public-policy-routing/46 active executing 10.244.41.66 (config-changed) Unit is ready
octavia/2 blocked executing 4/lxd/9 10.244.41.26 9876/tcp 'shared-db' missing, 'amqp' missing, 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting leader to create required resources
  hacluster-octavia/1 blocked executing 10.244.41.26 (config-changed) Insufficient peer units for ha cluster (require 3)
  logrotated/58 active executing 10.244.41.26 (config-changed) Unit is ready.
  neutron-openvswitch-octavia/1 maintenance executing 10.244.41.26 (config-changed) Configuring ovs
  public-policy-routing/44 active executing 10.244.41.26 (start) Unit is ready

Example test run:
https://solutions.qa.canonical.com/testruns/testRun/3675e02f-09b9-4695-9696-ae6ae7f4921d

tags: added: cdo-qa foundations-engine
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is blocking solutions QA release testing; sub'd to field high

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Hi and thanks for reporting! This is strange because the only way this can happen is if the number of peer units is smaller than the config option `cluster_count` [0], which you have correctly set to 3. related_units() is called in order to determine that, however it is decorated with `@cache` [1]. I'm wondering if this can be the culprit.

[0] https://github.com/openstack/charm-hacluster/blob/master/hooks/utils.py#L1338
[1] https://github.com/openstack/charm-hacluster/blob/master/charmhelpers/core/hookenv.py#L539

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This is a strange one, to be sure. I suspect we'd need to have a poke at a live (failed) system to see what is going on with the relation data to get to this state. Is it possible to give us a nudge in OpenStack charms team when this next occurs so we can take a look please?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1915045] Re: units stuck in insufficient peers even though there are peers
Download full text (3.7 KiB)

In general we're running automated tests, all the time.

Why isn't this relation data being logged? Is there a way to turn on
logging for it?

We can try to reproduce manually depending on time availability.

On Tue, Feb 9, 2021 at 5:15 AM Alex Kavanagh <email address hidden>
wrote:

> This is a strange one, to be sure. I suspect we'd need to have a poke
> at a live (failed) system to see what is going on with the relation data
> to get to this state. Is it possible to give us a nudge in OpenStack
> charms team when this next occurs so we can take a look please?
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1915045
>
> Title:
> units stuck in insufficient peers even though there are peers
>
> Status in OpenStack hacluster charm:
> New
>
> Bug description:
> In an HA deploy, we have 3 units of hacluster-octavia, but 2 say there
> aren't enough peers and remain blocked forever:
>
> octavia/0* blocked executing 0/lxd/8
> 10.244.40.104 9876/tcp 'shared-db' missing, 'amqp' missing,
> 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting end-user
> execution of `configure-resources` action to create required resources
> hacluster-octavia/0* blocked executing
> 10.244.40.104 Insufficient peer units for ha cluster
> (require 3)
> logrotated/21 active executing
> 10.244.40.104 (config-changed) Unit is ready.
> neutron-openvswitch-octavia/0* maintenance executing
> 10.244.40.104 (config-changed) Configuring ovs
> public-policy-routing/11 active executing
> 10.244.40.104 (start) Unit is ready
> octavia/1 blocked executing 2/lxd/8
> 10.244.41.66 9876/tcp 'shared-db' incomplete, 'amqp' missing,
> 'identity-service' missing, 'sdn-subordinate' missing, Awaiting leader to
> create required resources
> hacluster-octavia/2 active executing
> 10.244.41.66 (leader-settings-changed) Unit is ready
> and clustered
> logrotated/61 active executing
> 10.244.41.66 (config-changed) Unit is ready.
> neutron-openvswitch-octavia/2 maintenance executing
> 10.244.41.66 (config-changed) Configuring ovs
> public-policy-routing/46 active executing
> 10.244.41.66 (config-changed) Unit is ready
> octavia/2 blocked executing 4/lxd/9
> 10.244.41.26 9876/tcp 'shared-db' missing, 'amqp' missing,
> 'identity-service' incomplete, 'sdn-subordinate' missing, Awaiting leader
> to create required resources
> hacluster-octavia/1 blocked executing
> 10.244.41.26 (config-changed) Insufficient peer units
> for ha cluster (require 3)
> logrotated/58 active executing
> 10.244.41.2...

Read more...

Changed in charm-hacluster:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack hacluster charm because there has been no activity for 60 days.]

Changed in charm-hacluster:
status: Incomplete → Expired
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We have another occurrence: https://solutions.qa.canonical.com/testruns/testRun/ab348755-9e31-4220-9ac8-3a7680883b57 . The crashdumps are here: https://oil-jenkins.canonical.com/artifacts/ab348755-9e31-4220-9ac8-3a7680883b57/index.html.
I suspect that it is an issue with juju relations not rendering completely.

Changed in charm-hacluster:
status: Expired → New
Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote :
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :
Download full text (3.4 KiB)

re: comment #7, the crashdump's juju-status.txt does have hacluster-placement saying 'Insufficient ..', but the actual units themselves seem fine:

placement/0 active idle 0/lxd/12 10.246.167.188 8778/tcp Unit is ready
  filebeat/70 active idle 10.246.167.188 Filebeat ready.
  hacluster-placement/2 active idle 10.246.167.188 Unit is ready and clustered
  landscape-client/66 active idle 10.246.167.188 System successfully registered
  logrotated/66 active idle 10.246.167.188 Unit is ready.
  nrpe/79 active idle 10.246.167.188 icmp,5666/tcp Ready
  placement-mysql-router/2 active idle 10.246.167.188 Unit is ready
  prometheus-grok-exporter/63 active idle 10.246.167.188 9144/tcp Unit is ready
  public-policy-routing/42 active idle 10.246.167.188 Unit is ready
  telegraf/70 active idle 10.246.167.188 9103/tcp Monitoring placement/0 (source version/commit 23.01-8-...)
  ubuntu-advantage/67 active idle 10.246.167.188 Attached (esm-apps,esm-infra)
placement/1* active idle 1/lxd/11 10.246.165.31 8778/tcp Unit is ready
  filebeat/52 active idle 10.246.165.31 Filebeat ready.
  hacluster-placement/1 active idle 10.246.165.31 Unit is ready and clustered
  landscape-client/50 active idle 10.246.165.31 System successfully registered
  logrotated/51 active idle 10.246.165.31 Unit is ready.
  nrpe/67 active idle 10.246.165.31 icmp,5666/tcp Ready
  placement-mysql-router/1 active idle 10.246.165.31 Unit is ready
  prometheus-grok-exporter/53 active idle 10.246.165.31 9144/tcp Unit is ready
  public-policy-routing/32 active idle 10.246.165.31 Unit is ready
  telegraf/51 active idle 10.246.165.31 9103/tcp Monitoring placement/1 (source version/commit 23.01-8-...)
  ubuntu-advantage/56 active idle 10.246.165.31 Attached (esm-apps,esm-infra)
placement/2 active idle 2/lxd/11 10.246.167.191 8778/tcp Unit is ready
  filebeat/54 active idle 10.246.167.191 Filebeat re...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.