Activity log for bug #1794569

Date Who What changed Old value New value Message
2018-09-26 17:01:42 Peter Slovak bug added bug
2018-09-26 17:57:42 Peter Slovak description Neutron version: 9.4.1 (EOL, but bug may still be present) Network scenario: Openvswitch with DVR Openvswitch version: 2.6.1 OpenStack installation version: Newton Operating system: Ubuntu 16.04.5 LTS Kernel: 4.4.0-135 x86_64 Symptoms: Instances whose default gateway is a DVR interface (10.10.255.1 in our case) occassionaly lose connectivity to non-local networks. Meaning, any packet that had to pass through the local virtual router is dropped. Sometimes this behavior lasts for a few milliseconds, sometimes tens of seconds. Since floating-ip traffic is a subset of those cases, north-south connectivity breaks too. Steps to reproduce: - Use DVR routing mode - Configure at least one static route in the virtual router, whose next hop is NOT an address managed by Neutron (e.g. a physical interface on a VPN gateway; in our case 10.2.0.0/24 with next-hop 10.10.0.254) - Have an instance plugged into a Flat or VLAN network, use the virtual router as the default gateway - Try to reach a host inside the statically-routed network from within the instance Possible explanation: Distributed routers get their ARP caches populated by neutron-l3-agent at its startup. The agent takes all the ports in a given subnet and fills in their IP-to-MAC mappings inside the qrouter- namespace, as permanent entries (meaning they won't expire from the cache). However, if Neutron doesn't manage an IP (as is the case with our static route's next-hop 10.10.0.254), a permanent record isn't created, naturally. So when we try to reach a host in the statically-routed network (e.g. 10.2.0.10) from inside the instance, the packet goes to default gateway (10.10.255.1). After it arrives to the qrouter- namespace, there is a static route for this host pointing to 10.10.0.254 as next-hop. However qrouter- doesn't have its MAC address, so what it does is it sends out an ARP request with source MAC of the distributed router's qr- interface. And that's the problem. Since ARP requests are usually broadcasts, they land on pretty much every hypervisor in the network within the same VLAN. Combined with the fact that qr- interfaces in a given qrouter- namespace have the same MAC address on every host, this leads to a disaster: every integration bridge will recieve that ARP request on the port that connects it to the Flat/VLAN network and learns that the qr- interface's MAC address is actually there - not on the qr- port also attached to br-int. From this moment on, packets from instances that need to pass via qrouter- are forwarded to the Flat/VLAN network interface, circumventing the qrouter- namespace. This is especially problematic with traffic that needs to be SNAT-ed on its way out. Workarounds: - The workaround that we used is creating stub Neutron ports for next-hop addresses, with correct MACs. After restarting neutron-l3-agents, they got populated into the qrouter- ARP cache as permanent entries. - Another workaround might consist of using ebtables/arptables on hypervisors to block incoming ARP requests from qrouters. Possible long-term sloution: Maybe it would help if ancillary bridges (those connecting Flat/VLAN network interfaces to br-int) contained an OVS flow that drops ARP requests with source MAC addresses of qr- interfaces originating from the physical interface. Since their IPs and MACs are well defined (their device_owner is "network:router_interface_distributed"), it shouldn't be a problem setting these flows up. However I'm not sure of the shortcomings of this approach. Neutron version: 9.4.1 (EOL, but bug may still be present) Network scenario: Openvswitch with DVR Openvswitch version: 2.6.1 OpenStack installation version: Newton Operating system: Ubuntu 16.04.5 LTS Kernel: 4.4.0-135 x86_64 Symptoms: Instances whose default gateway is a DVR interface (10.10.255.1 in our case) occassionaly lose connectivity to non-local networks. Meaning, any packet that had to pass through the local virtual router is dropped. Sometimes this behavior lasts for a few milliseconds, sometimes tens of seconds. Since floating-ip traffic is a subset of those cases, north-south connectivity breaks too. Steps to reproduce: - Use DVR routing mode - Configure at least one static route in the virtual router, whose next hop is NOT an address managed by Neutron (e.g. a physical interface on a VPN gateway; in our case 10.2.0.0/24 with next-hop 10.10.0.254) - Have an instance plugged into a Flat or VLAN network, use the virtual router as the default gateway - Try to reach a host inside the statically-routed network from within the instance Possible explanation: Distributed routers get their ARP caches populated by neutron-l3-agent at its startup. The agent takes all the ports in a given subnet and fills in their IP-to-MAC mappings inside the qrouter- namespace, as permanent entries (meaning they won't expire from the cache). However, if Neutron doesn't manage an IP (as is the case with our static route's next-hop 10.10.0.254), a permanent record isn't created, naturally. So when we try to reach a host in the statically-routed network (e.g. 10.2.0.10) from inside the instance, the packet goes to default gateway (10.10.255.1). After it arrives to the qrouter- namespace, there is a static route for this host pointing to 10.10.0.254 as next-hop. However qrouter- doesn't have its MAC address, so what it does is it sends out an ARP request with source MAC of the distributed router's qr- interface. And that's the problem. Since ARP requests are usually broadcasts, they land on pretty much every hypervisor in the network within the same VLAN. Combined with the fact that qr- interfaces in a given qrouter- namespace have the same MAC address on every host, this leads to a disaster: every integration bridge will recieve that ARP request on the port that connects it to the Flat/VLAN network and learns that the qr- interface's MAC address is actually there - not on the qr- port also attached to br-int. From this moment on, packets from instances that need to pass via qrouter- are forwarded to the Flat/VLAN network interface, circumventing the qrouter- namespace. This is especially problematic with traffic that needs to be SNAT-ed on its way out. Workarounds: - The workaround that we used is creating stub Neutron ports for next-hop addresses, with correct MACs. After restarting neutron-l3-agents, they got populated into the qrouter- ARP cache as permanent entries. - Next option is setting the static route into the instances' routing tables instead of the virtual router. This way it's the instance that makes ARP discovery and not the qrouter- namespace. - Another workaround might consist of using ebtables/arptables on hypervisors to block incoming ARP requests from qrouters. Possible long-term solution: Maybe it would help if ancillary bridges (those connecting Flat/VLAN network interfaces to br-int) contained an OVS flow that drops ARP requests with source MAC addresses of qr- interfaces originating from the physical interface. Since their IPs and MACs are well defined (their device_owner is "network:router_interface_distributed"), it shouldn't be a problem setting these flows up. However I'm not sure of the shortcomings of this approach.
2018-09-26 19:56:05 Nate Johnston neutron: status New Invalid
2018-09-27 03:59:40 Swaminathan Vasudevan tags drop dvr route static traffic drop dvr l3-dvr-backlog route static traffic
2019-10-03 18:28:27 Peter Slovak tags drop dvr l3-dvr-backlog route static traffic arp drop dvr fip floatingip l3-dvr-backlog route static traffic vlan
2019-10-03 20:03:34 Peter Slovak neutron: status Invalid New
2019-10-03 20:04:37 Peter Slovak description Neutron version: 9.4.1 (EOL, but bug may still be present) Network scenario: Openvswitch with DVR Openvswitch version: 2.6.1 OpenStack installation version: Newton Operating system: Ubuntu 16.04.5 LTS Kernel: 4.4.0-135 x86_64 Symptoms: Instances whose default gateway is a DVR interface (10.10.255.1 in our case) occassionaly lose connectivity to non-local networks. Meaning, any packet that had to pass through the local virtual router is dropped. Sometimes this behavior lasts for a few milliseconds, sometimes tens of seconds. Since floating-ip traffic is a subset of those cases, north-south connectivity breaks too. Steps to reproduce: - Use DVR routing mode - Configure at least one static route in the virtual router, whose next hop is NOT an address managed by Neutron (e.g. a physical interface on a VPN gateway; in our case 10.2.0.0/24 with next-hop 10.10.0.254) - Have an instance plugged into a Flat or VLAN network, use the virtual router as the default gateway - Try to reach a host inside the statically-routed network from within the instance Possible explanation: Distributed routers get their ARP caches populated by neutron-l3-agent at its startup. The agent takes all the ports in a given subnet and fills in their IP-to-MAC mappings inside the qrouter- namespace, as permanent entries (meaning they won't expire from the cache). However, if Neutron doesn't manage an IP (as is the case with our static route's next-hop 10.10.0.254), a permanent record isn't created, naturally. So when we try to reach a host in the statically-routed network (e.g. 10.2.0.10) from inside the instance, the packet goes to default gateway (10.10.255.1). After it arrives to the qrouter- namespace, there is a static route for this host pointing to 10.10.0.254 as next-hop. However qrouter- doesn't have its MAC address, so what it does is it sends out an ARP request with source MAC of the distributed router's qr- interface. And that's the problem. Since ARP requests are usually broadcasts, they land on pretty much every hypervisor in the network within the same VLAN. Combined with the fact that qr- interfaces in a given qrouter- namespace have the same MAC address on every host, this leads to a disaster: every integration bridge will recieve that ARP request on the port that connects it to the Flat/VLAN network and learns that the qr- interface's MAC address is actually there - not on the qr- port also attached to br-int. From this moment on, packets from instances that need to pass via qrouter- are forwarded to the Flat/VLAN network interface, circumventing the qrouter- namespace. This is especially problematic with traffic that needs to be SNAT-ed on its way out. Workarounds: - The workaround that we used is creating stub Neutron ports for next-hop addresses, with correct MACs. After restarting neutron-l3-agents, they got populated into the qrouter- ARP cache as permanent entries. - Next option is setting the static route into the instances' routing tables instead of the virtual router. This way it's the instance that makes ARP discovery and not the qrouter- namespace. - Another workaround might consist of using ebtables/arptables on hypervisors to block incoming ARP requests from qrouters. Possible long-term solution: Maybe it would help if ancillary bridges (those connecting Flat/VLAN network interfaces to br-int) contained an OVS flow that drops ARP requests with source MAC addresses of qr- interfaces originating from the physical interface. Since their IPs and MACs are well defined (their device_owner is "network:router_interface_distributed"), it shouldn't be a problem setting these flows up. However I'm not sure of the shortcomings of this approach. Neutron version: 10.0.7 Network scenario: Openvswitch with DVR Openvswitch version: 2.6.1 OpenStack installation version: Ocata Operating system: Ubuntu 16.04.5 LTS Kernel: 4.4.0-135 x86_64 Symptoms: Instances whose default gateway is a DVR interface (10.10.255.1 in our case) occassionaly lose connectivity to non-local networks. Meaning, any packet that had to pass through the local virtual router is dropped. Sometimes this behavior lasts for a few milliseconds, sometimes tens of seconds. Since floating-ip traffic is a subset of those cases, north-south connectivity breaks too. Steps to reproduce: - Use DVR routing mode - Configure at least one static route in the virtual router, whose next hop is NOT an address managed by Neutron (e.g. a physical interface on a VPN gateway; in our case 10.2.0.0/24 with next-hop 10.10.0.254) - Have an instance plugged into a Flat or VLAN network, use the virtual router as the default gateway - Try to reach a host inside the statically-routed network from within the instance Possible explanation: Distributed routers get their ARP caches populated by neutron-l3-agent at its startup. The agent takes all the ports in a given subnet and fills in their IP-to-MAC mappings inside the qrouter- namespace, as permanent entries (meaning they won't expire from the cache). However, if Neutron doesn't manage an IP (as is the case with our static route's next-hop 10.10.0.254), a permanent record isn't created, naturally. So when we try to reach a host in the statically-routed network (e.g. 10.2.0.10) from inside the instance, the packet goes to default gateway (10.10.255.1). After it arrives to the qrouter- namespace, there is a static route for this host pointing to 10.10.0.254 as next-hop. However qrouter- doesn't have its MAC address, so what it does is it sends out an ARP request with source MAC of the distributed router's qr- interface. And that's the problem. Since ARP requests are usually broadcasts, they land on pretty much every hypervisor in the network within the same VLAN. Combined with the fact that qr- interfaces in a given qrouter- namespace have the same MAC address on every host, this leads to a disaster: every integration bridge will recieve that ARP request on the port that connects it to the Flat/VLAN network and learns that the qr- interface's MAC address is actually there - not on the qr- port also attached to br-int. From this moment on, packets from instances that need to pass via qrouter- are forwarded to the Flat/VLAN network interface, circumventing the qrouter- namespace. This is especially problematic with traffic that needs to be SNAT-ed on its way out. Workarounds: - The workaround that we used is creating stub Neutron ports for next-hop addresses, with correct MACs. After restarting neutron-l3-agents, they got populated into the qrouter- ARP cache as permanent entries. - Next option is setting the static route into the instances' routing tables instead of the virtual router. This way it's the instance that makes ARP discovery and not the qrouter- namespace. - Another workaround might consist of using ebtables/arptables on hypervisors to block incoming ARP requests from qrouters. Possible long-term solution: Maybe it would help if ancillary bridges (those connecting Flat/VLAN network interfaces to br-int) contained an OVS flow that drops ARP requests with source MAC addresses of qr- interfaces originating from the physical interface. Since their IPs and MACs are well defined (their device_owner is "network:router_interface_distributed"), it shouldn't be a problem setting these flows up. However I'm not sure of the shortcomings of this approach.