Health manager port not deleted on unit removal

Bug #1915512 reported by James Vaughn
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
Fix Released
High
Hua Zhang

Bug Description

When removing an Octavia unit, the corresponding health manager Neutron port is not removed.

Implication:

- Amphora heartbeats may be sent to a non-existent address (corresponding to a no-longer-existent Octavia unit), resulting in missed heartbeats and amphorae being failed over unnecessarily.

Steps to reproduce:

- Deploy an OpenStack environment with Octavia (or in HA, 3 units with the hacluster charm)
- If running an HA environment, add a unit
- Delete an Octavia unit
- Check `openstack port list`, you'll see the "octavia-health-manager-octavia-N-listen-port" port still present
- Check `juju run --unit octavia/leader leader-get controller-ip-port-list`, and you'll still see the old IPs. This is set by querying Neutron, and it is used to configure "[health_manager]/controller-ip-port-list" in /etc/octavia/octavia.conf.

Workaround:

- Remove the unit
- Delete the port *after* removing the unit. Otherwise, Juju may recreate the port.

I don't see any code that performs this removal (unless I'm mistaken) so if this is intentional behaviour, I guess this is a feature request.

Tags: scaleback sts
Revision history for this message
Laszlo Angyal (angyallaszlo) wrote :

the side effect of this behavior is really serious: a newly created amphora will get the list of IPs of the controller units; since some of the IPs are "ghost IPs", the heartbeat messages never arrive -> the control plane decides to kill the (innocent) amphora causing a continuous failover

tags: added: sts
Changed in charm-octavia:
milestone: none → 21.04
importance: Undecided → High
Changed in charm-octavia:
assignee: nobody → Ponnuvel Palaniyappan (pponnuvel)
Changed in charm-octavia:
status: New → Confirmed
tags: added: scaleback
Hua Zhang (zhhuabj)
Changed in charm-octavia:
assignee: Ponnuvel Palaniyappan (pponnuvel) → Hua Zhang (zhhuabj)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (master)
Changed in charm-octavia:
status: Confirmed → In Progress
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi, I found juju has no pre hook to clean up hm port resources before removing unit, so I proposed another way to do it, pls see my above patch for the details. Does anyone have any better suggestions? Meanwhile I will also continue to complete the test code and self-test next. thanks.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1915512] Re: Health manager port not deleted on unit removal

On Fri, 2021-04-23 at 10:24 +0000, Hua Zhang wrote:
> Hi, I found juju has no pre hook to clean up hm port resources before
> removing unit, so I proposed another way to do it, pls see my above
> patch for the details. Does anyone have any better suggestions?
> Meanwhile I will also continue to complete the test code and self-
> test
> next. thanks.

Since juju-2.8.0 there is JUJU_DEPARTING_UNIT environment variable to
figure out what unit is being removed, so when `local_unit() ==
departing_unit()` is where the clean up should happen. The fix could
take advantage of this and provide a action (maybe?) that operators
running juju<2.8.0 would need to use to clean up ports.

Revision history for this message
Felipe Reyes (freyes) wrote :

More details on departing_unit() at
https://github.com/juju/charm-helpers/pull/563

Revision history for this message
Hua Zhang (zhhuabj) wrote :

@Felipe, thanks for your suggestion, I've implemented it in patchset 3, but there
are still the following problems.

We have two operations, delete_hm_port and update_neutron_ip_list, the data
departing_unit_name is required for both operations, and it can only be
obtained from octavia/1.
(suppose departing unit is octavia/7 and leader unit is octavia/1).

So

octavia/7 should run cluster-departed hook to obtain departing_unit_name

@reactive.when_not('cluster.connected')
def cluster_departed():
    if hookenv.local_unit() == hookenv.departing_unit():
        # run delete_hm_port here

and octavia/1 should be running cluster-relation-changed hook and/or
cluster-broken and cluster-departed to run existing update_neutron_ip_list

...
+@reactive.when_not('cluster.connected')
def update_controller_ip_port_list():
    # run existing update_neutron_ip_list here

Now the question will be:

1, We can use relation_set() to pass departing_unit_name from octavia/7
to octavia/3

2, but how to pass event from octavia/7 to octavia/3 to trigger
update_controller_ip_port_list after running cluster_departed.
because it seems the above proposal can't guarantee the order to first run
cluster_departed then run update_controller_ip_port_list

Do you guys have any ideas to find a better solution for this? thanks.

Changed in charm-octavia:
milestone: 21.04 → none
Revision history for this message
Hua Zhang (zhhuabj) wrote :

The leader unit octavia/0 also has the env variable JUJU_DEPARTING_UNIT=octavia/5 when running 'juju remove-unit octavia/5'

$ juju debug-hooks octavia/0 cluster-relation-joined cluster-relation-changed cluster-relation-broken cluster-relation-departe
...
root@juju-7781dd-octavia-8:/var/lib/juju/agents/unit-octavia-0/charm# set |grep JUJU |grep 'octavia\/'
JUJU_CONTEXT_ID=octavia/0-cluster-relation-departed-3345772523281443755
JUJU_DEPARTING_UNIT=octavia/5
JUJU_REMOTE_UNIT=octavia/5
JUJU_UNIT_NAME=octavia/0

So I implemented patchset 4 [1] to avoid passing departing_unit_name from deprted unit to leader unit [2], but my self-test shows the following code has never been run

reactive.flags.register_trigger(
    when_not='cluster.connected', set_flag='unit.departed')

...
@reactive.when('unit.departed')
def update_controller_ip_port_list():
    ...

because I can always see the flag cluster.connected

$ juju run --unit octavia/0 -- 'charms.reactive -p get_flags' |grep cluster
 'charms.openstack.do-default-cluster.available',
 'cluster.available',
 'cluster.connected',

After experiencing these setbacks, I feel that it should be a good choice to use 'action' to implement these logic. Hope to hear your thoughts.

[1] https://review.opendev.org/c/openstack/charm-octavia/+/787700/4/src/reactive/octavia_handlers.py#47
[2] https://bugs.launchpad.net/charm-octavia/+bug/1915512/comments/6

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Summary of some patches so far.

1, patchset4 [1], running 'hookenv.departing_unit' in leader unit.

reactive.flags.register_trigger(
    when_not='cluster.connected', set_flag='unit.departed')

...
@reactive.when('unit.departed')
def update_controller_ip_port_list():
    departing_unit = ch_core.hookenv.departing_unit()
    if not departing_unit:
        # do clean-up staffs here
        reactive.clear_flag('unit.departed')

The problem is that the flag 'cluster.connected' is always there even after running 'juju remove-unit xxx'.
I was told that the uit being removed is the one that gets that event, the leader unit is not leaving the cluster realtion so it will retain that flag.

See comment #7 for more details - https://bugs.launchpad.net/charm-octavia/+bug/1915512/comments/7

2, so do we have to run 'hookenv.departing_unit' in departed unit side? That's what patchset3 [2] does.

Yeah, we can sucessfully get departing_unit_name by running hookenv.departing_unit in departed unit side

@reactive.when_not('cluster.connected')
def cluster_departed():
    if hookenv.local_unit() == hookenv.departing_unit():

but how to pass departing_unit_name from departed unit to leader unit because update_controller_ip_port_list running in leader unit side needs this parameter? and how to guarantee the order to first run cluster_departed then run update_controller_ip_port_list

See comment #6 for more details - https://bugs.launchpad.net/charm-octavia/+bug/1915512/comments/6

3, Building departing_unit_name list in leader unit itself to bypass the problem 2), That's what patchset2 [3] does.

       for u in ch_core.hookenv.iter_units_for_relation_name('cluster'):
            running_units.append(u.unit.replace('/', '-'))
        for unit_name in db_units:
            if unit_name not in running_units:
                missing_units.append(unit_name)

But we also have one problem, when running 'juju remove-unit octavia/3', you will see octavia/3 is still in the output of iter_units_for_relation_name.

So we may have to cope with it through an action or something like that.

[1] https://review.opendev.org/c/openstack/charm-octavia/+/787700/4/src/reactive/octavia_handlers.py
[2] https://review.opendev.org/c/openstack/charm-octavia/+/787700/3/src/reactive/octavia_handlers.py
[3] https://review.opendev.org/c/openstack/charm-octavia/+/787700/2/src/reactive/octavia_handlers.py

Revision history for this message
Hua Zhang (zhhuabj) wrote :

I just removed the following the flag unit.departed then update_controller_ip_port_list can be triggered after removing unit.

reactive.flags.register_trigger(
    when_not='cluster.connected', set_flag='unit.departed')

@reactive.when('unit.departed')
def update_controller_ip_port_list():

This is self-test data for patchset5 - https://paste.ubuntu.com/p/rkJW5NKcKY/

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Octavia charm patch - https://review.opendev.org/c/openstack/charm-octavia/+/787700
Zaza functional test patch - https://github.com/openstack-charmers/zaza-openstack-tests/pull/600
Self-test result for two patches - https://paste.ubuntu.com/p/7HmTvmxC48/

NOTE: Zaza functional test patch depends on Octavia charm patch, so the CI for Zaza functional test patch will not succeed before Octavia charm patch is merged.

Changed in charm-octavia:
milestone: none → 21.10
Changed in charm-octavia:
status: In Progress → Fix Committed
milestone: 21.10 → 22.04
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (master)

Reviewed: https://review.opendev.org/c/openstack/charm-octavia/+/787700
Committed: https://opendev.org/openstack/charm-octavia/commit/cc292d72e2cb47366e093490c5319503f86d93ec
Submitter: "Zuul (22348)"
Branch: master

commit cc292d72e2cb47366e093490c5319503f86d93ec
Author: Zhang Hua <email address hidden>
Date: Sat May 1 08:32:21 2021 +0800

    Delete hm port on unit removal

    JUJU_DEPARTING_UNIT will be set in leader unit as well when removing one
    unit, so update_controller_ip_port_list can use it to delete hm port and
    update neutron ip list. Besides, cluster.{connected,available} will be
    set by interface-openstack-ha when relation-{broken,departed} is changed,
    so update_controller_ip_port_list will also have a change to run.
    In other words, as long as update_controller_ip_port_list is triggered
    for various reasons, the cleanup will be completed.

    and pin cffi to 1.14.6 and pyparsing<3.0.0 for python < 3.6

    Closes-Bug: 1915512
    Func-Test-Pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/600
    Change-Id: I88c61b8d2d0b573df7df071ed7978e83b6803c5c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (stable/21.10)

Fix proposed to branch: stable/21.10
Review: https://review.opendev.org/c/openstack/charm-octavia/+/819246

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (stable/21.10)

Reviewed: https://review.opendev.org/c/openstack/charm-octavia/+/819246
Committed: https://opendev.org/openstack/charm-octavia/commit/d2143cefc774ef22c7058746cacc99adb94c2b13
Submitter: "Zuul (22348)"
Branch: stable/21.10

commit d2143cefc774ef22c7058746cacc99adb94c2b13
Author: Zhang Hua <email address hidden>
Date: Sat May 1 08:32:21 2021 +0800

    Delete hm port on unit removal

    JUJU_DEPARTING_UNIT will be set in leader unit as well when removing one
    unit, so update_controller_ip_port_list can use it to delete hm port and
    update neutron ip list. Besides, cluster.{connected,available} will be
    set by interface-openstack-ha when relation-{broken,departed} is changed,
    so update_controller_ip_port_list will also have a change to run.
    In other words, as long as update_controller_ip_port_list is triggered
    for various reasons, the cleanup will be completed.

    NOTE: also resolve cherry-pick conflict with the commit f235f68f,
    remove pinning pyparsing package as it is already updated in
    sec/build.lock and remove Func-Test-Pr

    Closes-Bug: 1915512
    Change-Id: I88c61b8d2d0b573df7df071ed7978e83b6803c5c
    (cherry picked from commit cc292d72e2cb47366e093490c5319503f86d93ec)
    Signed-off-by: Zhang Hua <email address hidden>

Changed in charm-octavia:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.