Example of the failure: https://71d2302875cffcacbcb7-bd54a9781d6bc663ca8af93b25749dfd.ssl.cf5.rackcdn.com/823300/1/gate/neutron-functional-with-uwsgi/1938908/testr_results.html
Stacktrace:
ft1.53: neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_failover_without_gwtesttools.testresult.real._StringException: Traceback (most recent call last):
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 718, in wait_until_true
eventlet.sleep(sleep)
File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/greenthread.py", line 36, in sleep
hub.switch()
File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
return self.greenlet.switch()
eventlet.timeout.Timeout: 60 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
return f(self, *args, **kwargs)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
return f(self, *args, **kwargs)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1694, in test_dvr_ha_router_failover_without_gw
self._test_dvr_ha_router_failover(enable_gw=False, vrrp_id=12)
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1680, in _test_dvr_ha_router_failover
utils.wait_until_true(lambda: primary.ha_state == 'backup')
File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 723, in wait_until_true
raise WaitTimeout(_("Timed out after %d seconds") % timeout)
neutron.common.utils.WaitTimeout: Timed out after 60 seconds
From the logs of the failed test I see only that router on one of the "agents" was properly transitioned first to backup and then to primary:
2022-01-04 11:04:57.973 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to backup on agent agent2
2022-01-04 11:05:07.184 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to primary on agent agent2
but router on the second agent not:
2022-01-04 11:04:59.956 73811 DEBUG neutron.agent.l3.ha [-] Current transition state of router 6652fbd8-2612-48a4-92fb-1b972c20b012: backup; Initial state was: primary _enqueue_state_change /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/l3/ha.py:158
In the journal log I see something like:
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Netlink reports ha-597350ae-19 down
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) Entering FAULT STATE
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) sent 0 priority
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) removing VIPs.
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Deassigned address fe80::1034:56ff:fe78:2bcc from interface ha-597350ae-19
I'm not sure if that is really the main issue why the test failed but we probably will need to add some more logs to the L3 HA functional tests and investigate it more in the future when similar failures will happen again.
Related fix proposed to branch: master /review. opendev. org/c/openstack /neutron/ +/824098
Review: https:/