Canonical Juju

private-address not refreshed in relation-data after binding change

Bug #1961448 reported by Rodrigo Barbieri on 2022-02-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Incomplete	High	Joseph Phillips
	OpenStack HA Cluster Charm	Fix Released	Undecided	Rodrigo Barbieri	OpenStack HA Cluster Charm 22.04

Bug Description

Juju version used: 2.9.12

On a fully functional deployment where hacluster has the correct binding for the hanode endpoint (therefore matching the IP assigned to the unit), changing the binding to an incorrect one (by running juju bind hacluster <wrong_binding> --force) expectedly causes network-get to fail and hanode-relation-changed hook failure, resulting in failure to write the IP to the ring0_addr properties in corosync.conf because the private-address property disappears from the relation-data (due to failure of network-get due to incorrect binding).

Now, setting the binding back to the correct one (through juju bind hacluster <correct_binding>) restores the network-get functionality, but it does not restore the missing private-address property from the relation-data. Therefore the hanode-relation-changed hook failure persists and the ring0_addr still cannot be written to corosync.conf because the private-address property is not found in the relation-data.

How to force refresh the relation-data to re-read parameters from network-get ?

As I understand, the properties private-address, ingress-address and egress-subnets are "essential" properties that are present in every endpoint, as long as network-get command is successful.

Is something blocking the relation-data to being refreshed or re-querying network-get ? like a hook error or blocked state?

Things I have tried:

1) First I tried smoothing out the errors from the wrong binding change until status was clear and back to active/idle, before invoking "juju bind hacluster <correct_binding>", such as:

a) juju resolved --no-retry
b) writing ring0_addr values in corosync.conf manually

Still, changing the binding to the correct one resulted in errors due to the lack of private-address property.

2) With the correct binding now set, I then tried to refresh the property and overcome the errors in several ways:

a) juju resolved --no-retry
b) writing ring0_addr values in corosync.conf manually
c) setting the private-address properties manually through relation-set
d) restarting jujud
e) restarting the lxd container

None of those would work, and despite having set the property manually, the code at [0] still re-read "None" from the private-address properties in the relation-data as if they weren't set.

[0] https://github.com/juju/charm-helpers/blob/446cbfdad83e15b5cfd20f862d3c3b5b1956b998/charmhelpers/contrib/hahelpers/cluster.py#L187

See original description

Tags:

Rodrigo Barbieri (rodrigo-barbieri2010) on 2022-02-21

description:

updated

Joseph Phillips (manadart) on 2022-02-22

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Joseph Phillips (manadart)
milestone:	none → 2.9.26

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-02-22:

quick update, I repeated my tests now doing relation-set of the ingress-address and egress-subnets properties as well, logs still showed "None" being read from the relation, I further insisted on "juju resolved --no-retry" and saw the properties now being read successfully. A juju config command flipping debug value later broke it again, but it healed itself afterwards.

So right now it seems the most consistent workaround is to apply the properties manually through relation-set and insist on "juju resolved --no-retry" until it finally works. Still, a bugfix is needed to force the network-get to be invoked and update the properties.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-02-23 (last edit on 2022-03-14):

Joseph Phillips (manadart) on 2022-03-06

Changed in juju:
status:	Triaged → Fix Committed

Joseph Phillips (manadart) on 2022-03-06

Changed in juju:
status:	Fix Committed → Triaged

Canonical Juju QA Bot (juju-qa-bot) on 2022-03-09

Changed in juju:
milestone:	2.9.26 → 2.9.27

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-03-14:

Hi @Joseph could you please post the PR link here? Thanks in advance

Canonical Juju QA Bot (juju-qa-bot) on 2022-03-18

Changed in juju:
milestone:	2.9.27 → 2.9.28

Rodrigo Barbieri (rodrigo-barbieri2010) on 2022-03-25

tags:

added: sts

Joseph Phillips (manadart) on 2022-03-30

Changed in juju:
status:	Triaged → Incomplete

Revision history for this message

Joseph Phillips (manadart) wrote on 2022-03-30:

As we discussed, the logic exists to update network relation data upon rebind.

I tried to reproduce this and got the expected behaviour on MAAS.

Spaces:
https://pastebin.canonical.com/p/H2dVddFqnb/

I deployed mariadb bound to space-default, and related it to mediawiki. Relation data looked like this in the DB:
https://pastebin.canonical.com/p/XPqNzc5YxK/

I rebound mariadb:
https://pastebin.canonical.com/p/fqXBxRGqbX/

Relation data changed as expected:
https://pastebin.canonical.com/p/ShVHpWtRsd/

This behaviour is triggered by the agent itself in the config-changed event following rebind. This *could* be blocked if the charm was in an error state requiring resolution, but apart from that I'd need more to go on. The happy path appears to work as designed.

Canonical Juju QA Bot (juju-qa-bot) on 2022-03-30

Changed in juju:
milestone:	2.9.28 → 2.9.29

Rodrigo Barbieri (rodrigo-barbieri2010) on 2022-03-30

description:

updated

Revision history for this message

Joseph Phillips (manadart) wrote on 2022-04-01:

I ran the same steps as above with 2.9.12 and got the same result.

Changed in juju:
milestone:	2.9.29 → none

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-06: Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/836887

Changed in charm-hacluster:
status:	New → In Progress

Felipe Reyes (freyes) on 2022-04-12

Changed in charm-hacluster:
assignee:	nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
milestone:	none → 22.04

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-12: Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/836887
Committed: https://opendev.org/openstack/charm-hacluster/commit/d54de3d3464352ca07e4b9d9f6a5c8350464b29b
Submitter: "Zuul (22348)"
Branch: master

commit d54de3d3464352ca07e4b9d9f6a5c8350464b29b
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Apr 6 18:42:13 2022 -0300

Prevent errors when private-address=None

    Whenever a peer returns None as its IP, it results in
    misconfiguration in corosync.conf, which results in
    a series of cascading hook errors that are difficult to
    sort out.

    More specifically, this usually happens when network-get
    does not work for the current binding. The main problem
    is that when changing bindings, a hook fires before the
    network-get data is updated. This hook fails and prevents
    the network-get from being re-read.

    This patch changes the code behavior to ignore None IP
    entries, therefore gracefully exiting and deferring further
    configuration due to insufficient number of peers when that
    happens, so that a later hook can successfully read the IP
    from the relation and set the IPs correctly in corosync.

Closes-bug: #1961448
Change-Id: I5ed140a17e184fcf6954d0f66e25f74564bd281c

Changed in charm-hacluster:
status:	In Progress → Fix Committed

Alex Kavanagh (ajkavanagh) on 2022-05-10

Changed in charm-hacluster:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-12: Fix proposed to charm-hacluster (stable/focal)

Fix proposed to branch: stable/focal
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/841588

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-13: Fix merged to charm-hacluster (stable/focal)

#10

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/841588
Committed: https://opendev.org/openstack/charm-hacluster/commit/07b7e5e367bde8d15ea7a2c1b631038c73158217
Submitter: "Zuul (22348)"
Branch: stable/focal

commit 07b7e5e367bde8d15ea7a2c1b631038c73158217
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Apr 6 18:42:13 2022 -0300

Prevent errors when private-address=None

    Whenever a peer returns None as its IP, it results in
    misconfiguration in corosync.conf, which results in
    a series of cascading hook errors that are difficult to
    sort out.

    Closes-bug: #1961448
    Change-Id: I5ed140a17e184fcf6954d0f66e25f74564bd281c
    (cherry picked from commit d54de3d3464352ca07e4b9d9f6a5c8350464b29b)

tags:

added: in-stable-focal

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.