LB status stays OFFLINE until config-changed is run explicitly after the initial deployment

Bug #1961088 reported by Nobuto Murata
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
Fix Committed
High
Samuel Walladge
Yoga
In Progress
Undecided
Unassigned
Zed
In Progress
Undecided
Unassigned

Bug Description

focal-ussuri

Octavia Amphora LBs get stuck at OFFLINE although the actual data plane is working. There seems an issue in updating the LB status in the control plane somewhere. However, after running config-changed hook explicitly everything starts working.

Might be related:
https://bugs.launchpad.net/charm-octavia/+bug/1953639

Restarting octavia related services didn't do the trick but the whole config-changed hook was required.

$ date; openstack loadbalancer list --format yaml
Wed Feb 16 15:15:16 UTC 2022
- id: 092cfccd-cb9f-46b1-961f-e92215c72447
  name: openstack-integrator-6b38f382a2fc-kubernetes-master
  operating_status: OFFLINE
  project_id: ef8e573ad99d45a2b9cf1725a355c0f3
  provider: amphora
  provisioning_status: ACTIVE
  vip_address: 10.5.5.14

^^^ operating_status: OFFLINE

But actual traffic to the backend through the floating IP(192.168.151.61) in front of LB works:

$ openstack floating ip list --fixed-ip 10.5.5.14 --format yaml
- Fixed IP Address: 10.5.5.14
  Floating IP Address: 192.168.151.61
  Floating Network: b81a6386-d544-4502-8087-901f3012576b
  ID: 86189424-7cfc-48b9-9a1f-ff864b24ef49
  Port: 872da267-4a7d-42a6-b0b3-bd35c14701e6
  Project: ef8e573ad99d45a2b9cf1725a355c0f3

$ kubectl -v6 get nodes
I0216 15:21:03.347246 94802 loader.go:372] Config loaded from file: /home/ubuntu/.kube/config
I0216 15:21:03.376692 94802 round_trippers.go:553] GET https://192.168.151.61:443/api/v1/nodes?limit=500 200 OK in 21 milliseconds
NAME STATUS ROLES AGE VERSION
juju-82a2fc-k8s-on-openstack-1 Ready <none> 37m v1.23.3

Tried to restart some services:
# date; systemctl restart octavia-worker.service
Wed Feb 16 15:29:19 UTC 2022
# date; systemctl restart octavia-health-manager.service
Wed Feb 16 15:31:37 UTC 2022
# date; systemctl restart octavia-housekeeping.service
Wed Feb 16 15:33:48 UTC 2022
# date; systemctl restart apache2.service
Wed Feb 16 15:35:48 UTC 2022
-> no luck

Config-changed hook explicitly:

$ juju run --unit octavia/0 -- hooks/config-changed
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Fetched 336 kB in 2s (157 kB/s)
Reading package lists...
Adding user systemd-network to group octavia
inactive
octavia-api (enabled by site administrator)
active
active
active
active
Synchronizing state of octavia-api.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable octavia-api
Unit /etc/systemd/system/octavia-api.service is masked, ignoring.

Then, LB status gets ONLINE.

$ date; openstack loadbalancer list --format yaml
Wed Feb 16 15:43:15 UTC 2022
- id: 092cfccd-cb9f-46b1-961f-e92215c72447
  name: openstack-integrator-6b38f382a2fc-kubernetes-master
  operating_status: ONLINE
  project_id: ef8e573ad99d45a2b9cf1725a355c0f3
  provider: amphora
  provisioning_status: ACTIVE
  vip_address: 10.5.5.14

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high.

It's somewhat reliably reproducible on a test env, and we've seen this symptom in a customer engagement. While the data plane is forwarding the traffic, automated testing such as Tempest relies on operating_status=ONLINE to determine if LB is ready to proceed with following operations/tests.

Nobuto Murata (nobuto)
description: updated
Revision history for this message
Nobuto Murata (nobuto) wrote :

In another deployment with focal-wallaby, I've tried to find out what's actually changed after running config-changed hook. And there is one line diff in octavia.conf. Not 100% sure it's the root cause though.

# git diff --no-index /etc/octavia.bak/ /etc/octavia/
diff --git a/etc/octavia.bak/octavia.conf b/etc/octavia/octavia.conf
index 7a54623..9651f07 100644
--- a/etc/octavia.bak/octavia.conf
+++ b/etc/octavia/octavia.conf
@@ -20,6 +20,7 @@ enabled_provider_agents = ovn

 [health_manager]
+bind_ip = fc00:675d:4035:73b3:f816:3eff:fe7a:e0b0
 controller_ip_port_list = fc00:675d:4035:73b3:f816:3eff:fe7a:e0b0:5555
 heartbeat_key = 20e8350e-950f-40fb-bb36-b9259e355523
 health_update_threads = 4

Revision history for this message
Nobuto Murata (nobuto) wrote :

> [health_manager]
> +bind_ip = fc00:675d:4035:73b3:f816:3eff:fe7a:e0b0
> controller_ip_port_list = fc00:675d:4035:73b3:f816:3eff:fe7a:e0b0:5555

This is weird because in the initial deployment controller_ip_port_list has the exact and expected IPv6 address, but bind_ip wasn't filled with the same one.

Looks like controller_ip_port_list is from Neutron API, and bind_ip is from the actual unit.

$ openstack port list --tag charm-octavia --format yaml
- Fixed IP Addresses:
  - ip_address: fc00:675d:4035:73b3:f816:3eff:fe7a:e0b0
    subnet_id: aac5a632-6fb0-4d2a-a8eb-483dd0963c8a
  ID: 393092c8-3391-4505-b982-e1c0cfec48a9
  MAC Address: fa:16:3e:7a:e0:b0
  Name: octavia-health-manager-octavia-0-listen-port
  Status: DOWN

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

So far that's something I was able to reproduce systematically on Focal-Wallaby deployment but it happens Focal-Xena as well.

Running the "config-changed" hook updates the configuration as mentioned by Nobuto and fixes the issue as well.

Changed in charm-octavia:
assignee: nobody → Aurelien Lourot (aurelien-lourot)
Changed in charm-octavia:
assignee: Aurelien Lourot (aurelien-lourot) → Dmitrii Shcherbakov (dmitriis)
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Looks like bind_ip is set to 127.0.0.1 by default:

https://github.com/openstack/octavia/blob/3c5bbbc189efe9f3ed4caae04a23d7a13718d368/octavia/common/config.py#L286-L288
    cfg.IPOpt('bind_ip', default='127.0.0.1',
              help=_('IP address the controller will listen on for '
                     'heart beats')),

It is used in UDPStatusGetter

https://github.com/openstack/octavia/blob/3c5bbbc189efe9f3ed4caae04a23d7a13718d368/octavia/amphorae/drivers/health/heartbeat_udp.py#L40-L48
class UDPStatusGetter(object):
    """This class defines methods that will gather heartbeats
    The heartbeats are transmitted via UDP and this class will bind to a port
    and absorb them
    """
    def __init__(self):
        self.key = cfg.CONF.health_manager.heartbeat_key
        self.ip = cfg.CONF.health_manager.bind_ip

which is used by the Octavia health manager:

https://github.com/openstack/octavia/blob/3c5bbbc189efe9f3ed4caae04a23d7a13718d368/octavia/cmd/health_manager.py#L42-L45

At the first glance, not having `bind_ip` set to a different value seems relevant, I'll continue looking further.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (4.4 KiB)

On a fresh octavia charm deployment I got a test failure and not all units have bind_ip set:

$ juju download octavia
Fetching charm "octavia" using "stable" channel and base "amd64/ubuntu/20.04"
Install the "octavia" charm with:
    juju deploy ./octavia_ac92496.charm

# ...

tox -e func-target focal-ussuri

# ...
--2022-03-15 14:55:13-- http://10.5.150.251/
Connecting to 10.5.150.251:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2022-03-15 14:55:13 ERROR 503: Service Unavailable.

2022-03-15 14:55:13 [INFO] ERROR
2022-03-15 14:55:13 [INFO] ======================================================================
2022-03-15 14:55:13 [INFO] ERROR: test_create_loadbalancer (zaza.openstack.charm_tests.octavia.tests.LBAASv2Test)
2022-03-15 14:55:13 [INFO] Create load balancer.
2022-03-15 14:55:13 [INFO] ----------------------------------------------------------------------
2022-03-15 14:55:13 [INFO] Traceback (most recent call last):
2022-03-15 14:55:13 [INFO] File "/tmp/octavia/.tox/func-target/lib/python3.8/site-packages/zaza/openstack/charm_tests/octavia/tests.py", line 448, in test_create_loadbalancer
2022-03-15 14:55:13 [INFO] assert snippet in self._get_payload(lb_fp['floating_ip_address'])
2022-03-15 14:55:13 [INFO] File "/tmp/octavia/.tox/func-target/lib/python3.8/site-packages/tenacity/__init__.py", line 324, in wrapped_f
2022-03-15 14:55:13 [INFO] return self(f, *args, **kw)
2022-03-15 14:55:13 [INFO] File "/tmp/octavia/.tox/func-target/lib/python3.8/site-packages/tenacity/__init__.py", line 404, in __call__
2022-03-15 14:55:13 [INFO] do = self.iter(retry_state=retry_state)
2022-03-15 14:55:13 [INFO] File "/tmp/octavia/.tox/func-target/lib/python3.8/site-packages/tenacity/__init__.py", line 360, in iter
2022-03-15 14:55:13 [INFO] raise retry_exc.reraise()
2022-03-15 14:55:13 [INFO] File "/tmp/octavia/.tox/func-target/lib/python3.8/site-packages/tenacity/__init__.py", line 193, in reraise
2022-03-15 14:55:13 [INFO] raise self.last_attempt.result()
2022-03-15 14:55:13 [INFO] File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
2022-03-15 14:55:13 [INFO] return self.__get_result()
2022-03-15 14:55:13 [INFO] File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
2022-03-15 14:55:13 [INFO] raise self._exception
2022-03-15 14:55:13 [INFO] File "/tmp/octavia/.tox/func-target/lib/python3.8/site-packages/tenacity/__init__.py", line 407, in __call__
2022-03-15 14:55:13 [INFO] result = fn(*args, **kwargs)
2022-03-15 14:55:13 [INFO] File "/tmp/octavia/.tox/func-target/lib/python3.8/site-packages/zaza/openstack/charm_tests/octavia/tests.py", line 382, in _get_payload
2022-03-15 14:55:13 [INFO] return subprocess.check_output(
2022-03-15 14:55:13 [INFO] File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
2022-03-15 14:55:13 [INFO] return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2022-03-15 14:55:13 [INFO] File "/usr/lib/python3.8/subprocess.py", line 516, in run
2022-03-15 14:55:13 [INFO] raise CalledProcessError(retcode, process.args,
2022-03-15 14:55:13 [INFO] subprocess.Ca...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Looking at the reactive flags, octavia/0 has charm.octavia.action_setup_hm_port and leadership.is_leader that differ from octavia/1 and octavia/2 (which have an identical set of flags):

https://paste.ubuntu.com/p/5D6hjRS8tT/

`juju show-status-log` outputs for 3 units:
https://paste.ubuntu.com/p/qTJMdZfTWX/

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Made another redeployment and `bind_ip` is present on all units:

$ juju run --application octavia 'sudo grep bind_ip /etc/octavia/octavia.conf'
- Stdout: |
    bind_ip = fc00:3c11:711:a792:f816:3eff:fef6:8df
  UnitId: octavia/0
- Stdout: |
    bind_ip = fc00:3c11:711:a792:f816:3eff:fe88:1160
  UnitId: octavia/1
- Stdout: |
    bind_ip = fc00:3c11:711:a792:f816:3eff:fef7:9187
  UnitId: octavia/2

However, I still got an error for the LB operating state.

$ openstack loadbalancer list
+--------------------------------------+----------------+----------------------------------+-------------+---------------------+------------------+----------+
| id | name | project_id | vip_address | provisioning_status | operating_status | provider |
+--------------------------------------+----------------+----------------------------------+-------------+---------------------+------------------+----------+
| 8c8628df-bd60-4173-968f-3bae954a1102 | zaza-amphora-0 | 0a149219a9794eba93ed14380bdc25af | 192.168.0.3 | ACTIVE | ERROR | amphora |
+--------------------------------------+----------------+----------------------------------+-------------+---------------------+------------------+----------+

So maybe my reproducer isn't exactly as the one Nobuto has.

Changed in charm-octavia:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Nobuto Murata (nobuto) wrote :

Would be nice if this issue can be tracked down.

Changed in charm-octavia:
assignee: Dmitrii Shcherbakov (dmitriis) → nobody
tags: added: cdo-qa cdo-tempest
Revision history for this message
Vern Hart (vern) wrote :

I just encountered this bug. Focal Ussuri. Running the config-changed hook resolved it.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

I wonder if we should just set `bind_ip` to '::' and stop trying to lock it down to the load balancer instance management network. This would make this much simpler and less error prone.

The low level socket calls made to set up the listener [0] will bind the socket to both IPv4 and IPv6 when given an v6 address, and use the v4-mapped-on-v6 address type to allow the program to work with both [1].

0: https://github.com/openstack/octavia/blob/3c5bbbc189efe9f3ed4caae04a23d7a13718d368/octavia/amphorae/drivers/health/heartbeat_udp.py#L62-L86
1: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man7/ipv6.7?id=902460e768b5c19d693f7ef87979b1613800cddf#n130

Revision history for this message
Nobuto Murata (nobuto) wrote :

> if we should just set `bind_ip` to '::'

This sounds reasonable to me although I'm curious why the issue happened in the first place in terms of the gap between enabling the link and getting the IPv6 RA (I had another issue as https://bugs.launchpad.net/charm-ovn-chassis/+bug/1965883).

Anyway it sounds like a path forward to overcome the current ongoing issue we need to apply a workaround almost always.

Revision history for this message
Samuel Walladge (swalladge) wrote (last edit ):

For why `bind_ip` doesn't get set initially, is it possible that the initial configuration writing can happen before the interfaces are set up? The function that gets the health_manager_bind_ip can return None if an exception happens when retrieving the interface for all interfaces, or if no interfaces are found:
https://opendev.org/openstack/charm-octavia/src/commit/975502813f51f022887508e7d2c5840308c198d8/src/lib/charm/openstack/octavia.py#L111-L115 This kind of race condition would explain why bind_ip is not set on deployment, but then becomes set when triggering a config change later (and thus reevaluation of this function).

It totally makes sense btw for the lb to display as offline while still working in this case - if the health manager is defaulting to listening on 127.0.0.1, where it cannot be accessed by whatever is checking the health. So the missing `bind_ip` is definitely related at least.

> if we should just set `bind_ip` to '::'

Two questions about this:

- will this introduce a security issue? Is it ok for the health manager to be listening on all interfaces?

- Do we want to always use this interface, or only use this as a fallback if health_manager_bind_ip() fails to get the interfaces?

Revision history for this message
Samuel Walladge (swalladge) wrote :
Changed in charm-octavia:
assignee: nobody → Samuel Walladge (swalladge)
Changed in charm-octavia:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/charm-octavia/+/866819

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/charm-octavia/+/866821

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/charm-octavia/+/869646

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-octavia (stable/ussuri)

Change abandoned by "Samuel Walladge <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/charm-octavia/+/866819
Reason: low priority for ussuri

Revision history for this message
Bartłomiej Poniecki-Klotz (barteus) wrote :

We got hit by this issue during the customer deployment of Kubernetes on top of OpenStack.

The OpenStack version is stable/yoga

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (master)

Reviewed: https://review.opendev.org/c/openstack/charm-octavia/+/858608
Committed: https://opendev.org/openstack/charm-octavia/commit/d8d8963667cb620370e2c424dbe668460ab99ff1
Submitter: "Zuul (22348)"
Branch: master

commit d8d8963667cb620370e2c424dbe668460ab99ff1
Author: Samuel Walladge <email address hidden>
Date: Wed Sep 21 14:36:17 2022 +0930

    Wait for management interface IP to be assigned

    There can be a delay between the interface being created,
    and an IP address getting assigned,
    which previously caused a race condition where
    the config could be rendered before the IP address was ready
    resulting in the health manager bind_ip to be empty.

    This ensures that the IP address will be ready before continuing,
    which will ensure that the config rendering will not happen until ready,
    and the configure-resources action will only return once it's all done.

    Closes-Bug: #1961088
    Change-Id: I2cae5f0e307c8cd14f1831f3416d890ad604b705

Changed in charm-octavia:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-octavia (stable/yoga)

Change abandoned by "Samuel Walladge <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/charm-octavia/+/866821

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/charm-octavia/+/897163

Revision history for this message
Nobuto Murata (nobuto) wrote :

Just for the record, I tried the latest/edge channel with the fix and it worked for me. This testbed was missing the bind_ip after the initial configure-resources action before.

====
+ juju run octavia/leader --wait=10m configure-resources
Running operation 9 with 1 task
  - task 10 on unit-octavia-0

Waiting for task 10...
Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists...
Adding user systemd-network to group octavia
inactive
Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists...
Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists...
octavia-api (enabled by site administrator)
active
active
active
active
active
ovs-vsctl: no row "o-hm0" in table Interface
Synchronizing state of octavia-api.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable octavia-api
Unit /etc/systemd/system/octavia-api.service is masked, ignoring.
Warning: Stopping systemd-networkd.service, but it can still be activated by:
  systemd-networkd.socket
+ juju exec --application octavia -- grep bind_ip /etc/octavia/octavia.conf
bind_ip = fc00:423c:c5f5:ce3b:f816:3eff:fea4:5d11

====

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.