Functional tests for HA routers fails due to router transitioned to FAULT state

Bug #1956958 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Critical
Slawek Kaplonski

Bug Description

Example of the failure: https://71d2302875cffcacbcb7-bd54a9781d6bc663ca8af93b25749dfd.ssl.cf5.rackcdn.com/823300/1/gate/neutron-functional-with-uwsgi/1938908/testr_results.html

Stacktrace:

ft1.53: neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_failover_without_gwtesttools.testresult.real._StringException: Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 718, in wait_until_true
    eventlet.sleep(sleep)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/greenthread.py", line 36, in sleep
    hub.switch()
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
    return self.greenlet.switch()
eventlet.timeout.Timeout: 60 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
    return f(self, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", line 183, in func
    return f(self, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1694, in test_dvr_ha_router_failover_without_gw
    self._test_dvr_ha_router_failover(enable_gw=False, vrrp_id=12)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/agent/l3/test_dvr_router.py", line 1680, in _test_dvr_ha_router_failover
    utils.wait_until_true(lambda: primary.ha_state == 'backup')
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/common/utils.py", line 723, in wait_until_true
    raise WaitTimeout(_("Timed out after %d seconds") % timeout)
neutron.common.utils.WaitTimeout: Timed out after 60 seconds

From the logs of the failed test I see only that router on one of the "agents" was properly transitioned first to backup and then to primary:

2022-01-04 11:04:57.973 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to backup on agent agent2
2022-01-04 11:05:07.184 73811 INFO neutron.agent.l3.ha [-] Router 12724de0-0899-4f11-b034-0776f8d5a46c transitioned to primary on agent agent2

but router on the second agent not:

2022-01-04 11:04:59.956 73811 DEBUG neutron.agent.l3.ha [-] Current transition state of router 6652fbd8-2612-48a4-92fb-1b972c20b012: backup; Initial state was: primary _enqueue_state_change /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/l3/ha.py:158

In the journal log I see something like:

sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Netlink reports ha-597350ae-19 down
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) Entering FAULT STATE
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) sent 0 priority
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: (VR_12) removing VIPs.
sty 04 12:04:58 ubuntu-focal-ovh-bhs1-0027878805 Keepalived_vrrp[113555]: Deassigned address fe80::1034:56ff:fe78:2bcc from interface ha-597350ae-19

I'm not sure if that is really the main issue why the test failed but we probably will need to add some more logs to the L3 HA functional tests and investigate it more in the future when similar failures will happen again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/824098

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/824098
Committed: https://opendev.org/openstack/neutron/commit/f3e836217c2dd84b96938d292a3fe94f346ccdc8
Submitter: "Zuul (22348)"
Branch: master

commit f3e836217c2dd84b96938d292a3fe94f346ccdc8
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jan 11 09:56:32 2022 +0100

    [Functional] Add extra logs to the L3 HA router transitions

    This patch adds extra logs to log current and expected ha state of the
    routers on various fake agents during the functional tests, log on which
    agent router is "primary" and where it is "backup" and when failover of
    the router should happen.
    Those logs should allow us better understand what happens during those
    functional tests and why some of them are failing from time to time.

    Related-Bug: #1956958
    Change-Id: I567036470b7256275f67e8ef3546ed780c81b5ae

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/824886

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/824886
Committed: https://opendev.org/openstack/neutron/commit/a43972ab9f7806a15ec1664129bf2c24073ede99
Submitter: "Zuul (22348)"
Branch: stable/xena

commit a43972ab9f7806a15ec1664129bf2c24073ede99
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jan 11 09:56:32 2022 +0100

    [Functional] Add extra logs to the L3 HA router transitions

    This patch adds extra logs to log current and expected ha state of the
    routers on various fake agents during the functional tests, log on which
    agent router is "primary" and where it is "backup" and when failover of
    the router should happen.
    Those logs should allow us better understand what happens during those
    functional tests and why some of them are failing from time to time.

    Related-Bug: #1956958
    Change-Id: I567036470b7256275f67e8ef3546ed780c81b5ae
    (cherry picked from commit f3e836217c2dd84b96938d292a3fe94f346ccdc8)

tags: added: in-stable-xena
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I was today checking logs from the failed functional job https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_41c/830623/1/check/neutron-functional-with-uwsgi/41cadb3/testr_results.html and it seems for me that problem is somewhere in the ip_monitor thread because in journal.log I can see that keepalived changed router's state to master:

Feb 23 15:08:39 ubuntu-focal-inmotion-iad3-0028579424 Keepalived_vrrp[235597]: (VR_1) Receive advertisement timeout
Feb 23 15:08:39 ubuntu-focal-inmotion-iad3-0028579424 Keepalived_vrrp[235597]: (VR_1) Entering MASTER STATE
Feb 23 15:08:39 ubuntu-focal-inmotion-iad3-0028579424 Keepalived_vrrp[235597]: (VR_1) setting VIPs.
Feb 23 15:08:39 ubuntu-focal-inmotion-iad3-0028579424 Keepalived_vrrp[235597]: (VR_1) setting E-VIPs.

But it wasn't never noticed by the neutron-keepalived-state-change-monitor thus router wasn't switched to be "primary" on this agent and test failed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/833434

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/833434
Committed: https://opendev.org/openstack/neutron/commit/4024168a0572519c0dd01cae0fe3e869a01ebf5a
Submitter: "Zuul (22348)"
Branch: master

commit 4024168a0572519c0dd01cae0fe3e869a01ebf5a
Author: Slawek Kaplonski <email address hidden>
Date: Fri Mar 11 17:03:24 2022 +0100

    Add extra logs to the ip_monitor class

    Those extra logs should tell more about what IP addresses are
    added/removed in the qrouter namespace by the keepalived process and
    hopefully help us understand failures in functional CI job,
    like are described in the related bug.

    Related-bug: #1956958
    Change-Id: I5e924922baffbf2e059f243b115ff799e8432a56

Changed in neutron:
importance: High → Critical
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Patch which hopefully should make it better https://review.opendev.org/c/openstack/neutron/+/836140

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/844314

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/844314
Committed: https://opendev.org/openstack/neutron/commit/d13da77107fd9e9166b891409376a67210f7f48b
Submitter: "Zuul (22348)"
Branch: master

commit d13da77107fd9e9166b891409376a67210f7f48b
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jun 1 17:15:42 2022 +0200

    Mark functional L3ha tests as unstable

    All functional tests which uses wait_until_ha_router_has_state() method
    are now marked as unstable so in case of timeout while waiting for
    router's state transition, job will not fail.

    Related-Bug: #1956958
    Change-Id: I0e5d08c1a9dc475c7b138c4934ef0331a4339a4c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/848585

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/848585
Committed: https://opendev.org/openstack/neutron/commit/e39011c73396af68fe5002cdbe4c6b3fe8b7cf23
Submitter: "Zuul (22348)"
Branch: master

commit e39011c73396af68fe5002cdbe4c6b3fe8b7cf23
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 4 11:09:23 2022 +0200

    Use common wait_until_ha_router_has_state method everywhere

    In the L3 functional tests framework module there is already helper
    method called wait_until_ha_router_has_state which should be used to
    wait for desired HA router's state.
    This method has proper debug logging added so debugging issues in CI is
    easier when it's used.
    It is also decorated with unstable_test decorator to skip tests when
    router will fail to transition to desired state (see related bug for
    details).

    In some tests this method wasn't used so we couldn't benefit from the
    logging and unstable_test decorator there. Now it should be unifed and
    used everywhere in the same way.

    Related-Bug: #1956958
    Change-Id: I9d79b123bb20ded327208d84a14d4f8d2e505087

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/864985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/864986

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/864985
Committed: https://opendev.org/openstack/neutron/commit/de89581ace06c74b090bf25f3408a3d3dccf56fb
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit de89581ace06c74b090bf25f3408a3d3dccf56fb
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jun 1 17:15:42 2022 +0200

    Mark functional L3ha tests as unstable

    All functional tests which uses wait_until_ha_router_has_state() method
    are now marked as unstable so in case of timeout while waiting for
    router's state transition, job will not fail.

    Related-Bug: #1956958
    Change-Id: I0e5d08c1a9dc475c7b138c4934ef0331a4339a4c
    (cherry picked from commit d13da77107fd9e9166b891409376a67210f7f48b)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/864986
Committed: https://opendev.org/openstack/neutron/commit/a0363177754f28b3c9353cac65f07c5e6d487d1e
Submitter: "Zuul (22348)"
Branch: stable/xena

commit a0363177754f28b3c9353cac65f07c5e6d487d1e
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jun 1 17:15:42 2022 +0200

    Mark functional L3ha tests as unstable

    All functional tests which uses wait_until_ha_router_has_state() method
    are now marked as unstable so in case of timeout while waiting for
    router's state transition, job will not fail.

    Related-Bug: #1956958
    Change-Id: I0e5d08c1a9dc475c7b138c4934ef0331a4339a4c
    (cherry picked from commit d13da77107fd9e9166b891409376a67210f7f48b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/872007

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/872098

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/872008

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/872009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/872099

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872008
Committed: https://opendev.org/openstack/neutron/commit/6ef9d235d2f9ba764f96d32741fffc28c80076f2
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 6ef9d235d2f9ba764f96d32741fffc28c80076f2
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 4 11:09:23 2022 +0200

    Use common wait_until_ha_router_has_state method everywhere

    In the L3 functional tests framework module there is already helper
    method called wait_until_ha_router_has_state which should be used to
    wait for desired HA router's state.
    This method has proper debug logging added so debugging issues in CI is
    easier when it's used.
    It is also decorated with unstable_test decorator to skip tests when
    router will fail to transition to desired state (see related bug for
    details).

    In some tests this method wasn't used so we couldn't benefit from the
    logging and unstable_test decorator there. Now it should be unifed and
    used everywhere in the same way.

    Related-Bug: #1956958
    Change-Id: I9d79b123bb20ded327208d84a14d4f8d2e505087
    (cherry picked from commit e39011c73396af68fe5002cdbe4c6b3fe8b7cf23)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872009
Committed: https://opendev.org/openstack/neutron/commit/fc62d1ea8ef5024cd1b851dab2f765ba425279c3
Submitter: "Zuul (22348)"
Branch: stable/xena

commit fc62d1ea8ef5024cd1b851dab2f765ba425279c3
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 4 11:09:23 2022 +0200

    Use common wait_until_ha_router_has_state method everywhere

    In the L3 functional tests framework module there is already helper
    method called wait_until_ha_router_has_state which should be used to
    wait for desired HA router's state.
    This method has proper debug logging added so debugging issues in CI is
    easier when it's used.
    It is also decorated with unstable_test decorator to skip tests when
    router will fail to transition to desired state (see related bug for
    details).

    In some tests this method wasn't used so we couldn't benefit from the
    logging and unstable_test decorator there. Now it should be unifed and
    used everywhere in the same way.

    Related-Bug: #1956958
    Change-Id: I9d79b123bb20ded327208d84a14d4f8d2e505087
    (cherry picked from commit e39011c73396af68fe5002cdbe4c6b3fe8b7cf23)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872007
Committed: https://opendev.org/openstack/neutron/commit/40a3347276a453f8d930a2cd8437d86fa9548b5d
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 40a3347276a453f8d930a2cd8437d86fa9548b5d
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jan 11 09:56:32 2022 +0100

    [Functional] Add extra logs to the L3 HA router transitions

    This patch adds extra logs to log current and expected ha state of the
    routers on various fake agents during the functional tests, log on which
    agent router is "primary" and where it is "backup" and when failover of
    the router should happen.
    Those logs should allow us better understand what happens during those
    functional tests and why some of them are failing from time to time.

    Related-Bug: #1956958
    Change-Id: I567036470b7256275f67e8ef3546ed780c81b5ae
    (cherry picked from commit f3e836217c2dd84b96938d292a3fe94f346ccdc8)
    (cherry picked from commit a43972ab9f7806a15ec1664129bf2c24073ede99)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872098
Committed: https://opendev.org/openstack/neutron/commit/858caf578ad13f359f46c29d6b48e42b2ddaab79
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 858caf578ad13f359f46c29d6b48e42b2ddaab79
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jun 1 17:15:42 2022 +0200

    Mark functional L3ha tests as unstable

    All functional tests which uses wait_until_ha_router_has_state() method
    are now marked as unstable so in case of timeout while waiting for
    router's state transition, job will not fail.

    Related-Bug: #1956958
    Change-Id: I0e5d08c1a9dc475c7b138c4934ef0331a4339a4c
    (cherry picked from commit d13da77107fd9e9166b891409376a67210f7f48b)
    (cherry picked from commit a0363177754f28b3c9353cac65f07c5e6d487d1e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872099
Committed: https://opendev.org/openstack/neutron/commit/4213c8da94a6acd32eb492c98d8c145c44f0bf8f
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 4213c8da94a6acd32eb492c98d8c145c44f0bf8f
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 4 11:09:23 2022 +0200

    Use common wait_until_ha_router_has_state method everywhere

    In the L3 functional tests framework module there is already helper
    method called wait_until_ha_router_has_state which should be used to
    wait for desired HA router's state.
    This method has proper debug logging added so debugging issues in CI is
    easier when it's used.
    It is also decorated with unstable_test decorator to skip tests when
    router will fail to transition to desired state (see related bug for
    details).

    In some tests this method wasn't used so we couldn't benefit from the
    logging and unstable_test decorator there. Now it should be unifed and
    used everywhere in the same way.

    Related-Bug: #1956958
    Change-Id: I9d79b123bb20ded327208d84a14d4f8d2e505087
    (cherry picked from commit e39011c73396af68fe5002cdbe4c6b3fe8b7cf23)
    (cherry picked from commit fc62d1ea8ef5024cd1b851dab2f765ba425279c3)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.