ovs flooding packets, not learning MAC addresses
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Undecided
|
Unassigned | ||
neutron (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Hi,
Using OpenStack rocky on Ubuntu 18.04, with dvr_snat and L3HA, and using the openvswitch firewall driver. openvswitch version 2.10.0-
I was doing load testing by creating a bunch of instances, and noticed that the network throughput available to instances dropped dramatically as I was creating VMs. In other words, with 2 VMs on my cloud, I had pretty good bandwith, but with 100 (idle) VMs, bandwidth became ridiculously slow.
Investigating the problem, I noticed that ovs was flooding traffic : all instances of an hypervisor were getting all the traffic destined to any VM on another hypervisor.
In other words, I had vmA1 and vmA2 on hypervisor A, and vmB1 on hypervisor B, then TCP traffic between vmA1 and vmB1 could be seen on vmA2.
Digging more into this, I think I located the problem in the ovs MAC learning process, more specifically on br-int (using "sudo ovs-appctl fdb/show br-int").
Traffic flow from vmA1 to vmB1, on hypervisor A, looks like : tap (on br-int), patch-tun (on br-int), patch-int (on br-tun), vxlan to hypervisor B.
So whenever traffic comes back (the other way around), the MAC address of vmB1 should be learned, on br-int, on the patch-tun port - and that is not the case. So whenever vmA1 sends traffic to vmB1, at some point it reaches the "NORMAL" action, and since the destination MAC is not learned, traffic is getting flooded : see ofproto/trace https:/
Digging more into this, it would appear that ovs learns a MAC address only from broadcast ARP requests, and not from ARP requests with a unicast MAC address (which is what Linux uses after a successful broadcast ARP request) : https:/
Once the MAC is learned, there's no more flooding : https:/
Flooding has security consequences (VMs can see traffic not destined to them - although only traffic for VMs in the same neutron network), and performance consequences, so it should be avoided.
Thanks
An additional datapoint : MAC learning appears to be working fine for subnets not attached to a router. As soon as I attach the subnet to a router, the bad behaviour starts.