DVR HA router gets stuck in backup state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Medium
|
Unassigned |
Bug Description
We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state
The reason for such behavior is not removed fip-ns on the agent while the floatingip_
Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with.
Сreate a router and quickly delete it while the l3 agent processes the external GW adding:
[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process
[root@devstack ~]# count_fip_
[root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw
[root@devstack ~]# fip_requests=
waiting before deletion...
waiting before deletion...
[root@devstack ~]#
As a result fip-ns is not deleted even though the floatingip_
[root@devstack ~]# ip netns
fip-8d4bc2d5-
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
<empty>
[root@devstack ~]#
Re-create the router together with external gw now:
[root@devstack ~]# openstack router create r1 --ha --distributed --external-gateway public
In the logs, one can see a traceback that the creation of this router failed initially, followed by a successful creation:
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
The result is the following state:
[root@devstack ~]# ip netns
fip-8d4bc2d5-
qrouter-
snat-1f384e52-
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
+------
| ID | Device Owner | Status |
+------
| 17679644-
| b489f216-
+------
[root@devstack ~]#
[root@devstack ~]# cat /opt/stack/
backup
[root@devstack ~]# stat /opt/stack/
...
Access: 2023-01-19 11:10:10.715245690 -0500
Modify: 2023-01-19 11:10:18.976208238 -0500
Change: 2023-01-19 11:10:18.976208238 -0500
Birth: 2023-01-19 11:10:10.715245690 -0500
[root@devstack ~]# stat /var/run/
...
Access: 2023-01-19 11:10:18.499210400 -0500
Modify: 2023-01-19 11:10:18.499210400 -0500
Change: 2023-01-19 11:10:18.499210400 -0500
Birth: -
[root@devstack ~]#
By timestamp we can see that a keepalived monitoring started to work before the snat-ns was re-created after unsuccessful first attempt to create a router.
So, it looks like the keepalived monitoring is still bound to the deleted snat-ns that was created on the previously unsuccessful attempt to create the router.
Adding an external gw and removing a router has a race condition and it's not always possible to get 100% reproduction. To achieve 100% reproduction, just add a small sleep with the following patch:
[root@devstack neutron]# git diff
diff --git a/neutron/
index 6e37c09511.
--- a/neutron/
+++ b/neutron/
@@ -837,6 +837,8 @@ class DvrLocalRouter(
+ import time
+ time.sleep(5)
def update_
[root@devstack neutron]#
description: | updated |
tags: | added: l3-ha |
Changed in neutron: | |
importance: | Undecided → Medium |
Hi, thanks for the detailed reproduction and explanation.
I tried to reproduce the issue with fresh devstack (I tried it several times) but for me everything was ok, could you give some more details?