[dvr+l3ha] north-south traffic not working when VM and main router are not on the same host
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Triaged
|
High
|
Unassigned |
Bug Description
Some newly created VM's are not able to reach "outside" resources (e.g. apt repositories) on the l3ha + dvr env, this problem can be easily reproduced as long as VM and main router are not on the same host, and 'apt update' command can not be run inside VM, so the north-south traffic is broken.
Here are steps to easily reproduce it.
1, set up wallaby or ussuri vrrp + dvr env (it works on train, not work on ussuri and wallaby)
2, create a test vm, query host by: nova show <VM> |grep host
3, query main router by: neutron l3-agent-
4, make sure VM and main router are not on the same host
5, on main router host, it will fail to run: ip netns exec snat-xxx ping <VM-IP> -c1
I've done some bisect, I found:
15.3.4 (bionic-train) - no problem
1c2e10f859 - no problem
16.4.0 (bionic-ussuri) - has problem
16.0.0-0ubuntu3 - has problem, and also have multiple active routers problem
16.0.0~
16.1.0 (focal) - has problem, and also have multiple active routers problem
16.2.0 (focal) - has problem
16.3.0 (focal) - has problem
16.4.0 (focal-ussuri) - has problem
focal-wallaby - has problem
Because I often have multiple standby issue with some commit id (eg: 14dd3e95ca) so that I can't continue bisect.
I also used 'ovs-appctl ofproto/trace' and tcpdump to do some debugs, the results are as follows.
train - works
sg-xxx -> vm - https:/
tcpdump on sg-xxx - https:/
tcpdump on vm's tap - https:/
tcpdump on qr-xxx - https:/
ussuri - not work
sg-xxx -> vm - https:/
tcpdump on sg-xxx - https:/
tcpdump on vm's tap - https:/
tcpdump on qr-xxx - https:/
It looks like VM can't get arp reply for sg-xxx interface,
description: | updated |
tags: | added: sts |
summary: |
- north-south traffic not working when VM and main router are not on the - same host + [dvr+l3ha] north-south traffic not working when VM and main router are + not on the same host |
I think I'm able to reproduce this on master (neutron commit ae4d8a0c20). I used a two-host ml2/ovs devstack setup:
devstack0 - all in one
local.conf excerpt:
[[local|localrc]] config| /etc/neutron/ neutron. conf]] agents_ per_router = 2 config| /etc/neutron/ plugins/ ml2/ml2_ conf.ini] ] distributed_ routing = True config| /etc/neutron/ l3_agent. ini]] auth_password = password health_ check_interval = 0
Q_DVR_MODE=dvr_snat
[[post-
[DEFAULT]
router_distributed = True
l3_ha = True
l3_ha_net_cidr = 169.254.192.0/18
max_l3_
[[post-
enable_
l2_population = True
[[post-
[DEFAULT]
agent_mode = dvr_snat
ha_vrrp_
ha_vrrp_
devstack0a - compute
local.conf excerpt:
[[local|localrc]] config| /etc/neutron/ neutron. conf]] config| /etc/neutron/ plugins/ ml2/ml2_ conf.ini] ] distributed_ routing = True config| /etc/neutron/ l3_agent. ini]]
Q_DVR_MODE=dvr
[[post-
[DEFAULT]
router_distributed = True
[[post-
[agent]
enable_
l2_population = True
[[post-
[DEFAULT]
agent_mode = dvr
Then opened up the default security group totally:
project_id="$( openstack project show "$OS_PROJECT_NAME" | awk '/ id / { print $4 }' )" group-rule- create --direction ingress --ethertype IPv4 "$default_sg_id" group-rule- create --direction ingress --ethertype IPv6 "$default_sg_id"
default_sg_id="$( neutron security-group-list --tenant-id "$project_id" | awk '/ default / { print $2 }' )"
openstack security group rule list "$default_sg_id"
openstack security group rule list "$default_sg_id" | egrep -w None | egrep -wv 'None.*None.*None' | awk '{ print $2 }' | xargs -r openstack security group rule delete
neutron security-
neutron security-
openstack security group rule list "$default_sg_id"
devstack's default router1 was indeed in dvr+l3ha mode:
$ openstack router show router1 -f table -c ha -c distributed ------- +------ -+ ------- +------ -+ ------- +------ -+
+------
| Field | Value |
+------
| distributed | True |
| ha | True |
+------
Booted a vm on the connected private network: 0.5.2-x86_ 64-disk --flavor cirros256 --nic net-id=private --availability-zone :devstack0a vm0 --wait
$ openstack server create --image cirros-
Took its address and pinged it:
$ openstack server show vm0 -f yaml -c addresses
$ sudo ip netns exec snat-$( openstack router show router1 -f value -c id ) ping -c3 10.0.0.55
And got no response.
While pinging on the relevant subnet's sg interface tcpdump got this:
$ sudo ip netns exec snat-$( openstack router show router1 -f value -c id ) tcpdump -i sg-7a37d0b0-e6 -n -vvv
tcpdump: listening on sg-7a37d0b0-e6, link-type EN10MB (Ethernet), capture size 262144 bytes
^C13:03:57.204512 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.0.55 tell 10.0.0.45, length 28
13:03:58.228329 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.0.55 tell 10.0.0.45, length 28
13:03:59.252240 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.0.55 tell 10.0.0.45, length 28
13:04:00.276460 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.0.55 tell 10.0.0.45, length 28
13:04:01.300116 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.0.55 tell 10.0.0.45, length 28
5 packets captured
5 packets received by filter
0 packets dropped by kernel