OVN Router sending ARP instead of sending traffic to the gateway

Bug #1881041 reported by Brendan Shephard
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

Summary:

When a VM has a Floating IP, any attempt to reach a routed network results in an ARP being sent instead of the traffic being sent to the Gateway.

Description:
I have two VM's:

$ openstack server list -f yaml
- Flavor: ''
  ID: f875fc7c-f743-4234-8ccb-c03f6ae66289
  Image: Fedora_32
  Name: fedora_no_fip
  Networks: infra_external=172.20.10.201
  Status: ACTIVE
- Flavor: ''
  ID: 4dd45015-9ad6-4388-b458-3128cbdd784b
  Image: Fedora_32
  Name: fedora_test
  Networks: infra_internal=192.168.10.102, 172.20.10.107
  Status: ACTIVE

The one without the FIP can reach anything fine. For example, ping 1.1.1.1:
[root@overcloud-novacompute-1 ~]# tcpdump -i any host 172.20.10.201 -nevvv
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
22:52:13.970470 P fa:16:3e:47:ee:dd ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 64, id 59289, offset 0, flags [DF], proto ICMP (1), length 84)
    172.20.10.201 > 1.1.1.1: ICMP echo request, id 1, seq 36, length 64
22:52:13.978619 P 00:e0:67:15:cc:2f ethertype 802.1Q (0x8100), length 104: vlan 4, p 0, ethertype IPv4, (tos 0x0, ttl 56, id 38296, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 172.20.10.201: ICMP echo reply, id 1, seq 36, length 64

But, when I try the same from the VM with the Floating IP, I can see that an ARP is being sent for 1.1.1.1:
[root@overcloud-novacompute-1 ~]# tcpdump -i any host 172.20.10.107 -nevvv
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
22:55:42.779383 B fa:16:3e:d7:80:3a ethertype 802.1Q (0x8100), length 48: vlan 4, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 172.20.10.107, length 28
22:55:42.779476 Out fa:16:3e:d7:80:3a ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 172.20.10.107, length 28
22:55:42.779510 Out fa:16:3e:d7:80:3a ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 172.20.10.107, length 28

The router has the gateway network set:
$ openstack router show infra_r1 -f yaml
admin_state_up: true
availability_zone_hints: null
availability_zones: null
created_at: '2020-05-27T11:43:43Z'
description: ''
external_gateway_info:
  enable_snat: true
  external_fixed_ips:
  - ip_address: 172.20.10.118
    subnet_id: bf21b56a-65c4-49fb-b345-b804c0429167
  network_id: 2561f8db-e1c8-4185-9056-0883686a8a53
flavor_id: null
id: 15c1b81d-b833-4d34-b622-4c6a0bd6c0d7
interfaces_info:
- ip_address: 192.168.10.1
  port_id: 65a28088-761c-461c-912c-7d0a3781ab6b
  subnet_id: 27382151-dbcc-4356-a080-47e181414e0b
location:
  cloud: ''
  project:
    domain_id: null
    domain_name: Default
    id: 0e446e02e899455193635c877772fae7
    name: admin
  region_name: regionOne
  zone: null
name: infra_r1
project_id: 0e446e02e899455193635c877772fae7
revision_number: 3
routes: []
status: ACTIVE
tags: []
updated_at: '2020-05-27T11:44:05Z'

Reproducer for me has been:
1. Deploy OpenStack with OVN DVR (Using TripleO, so the settings by default here: https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/environments/services/neutron-ovn-dvr-ha.yaml)
2. Create an external network that is a VLAN:
$ openstack network show infra_external -f yaml
admin_state_up: true
availability_zone_hints: []
availability_zones: []
created_at: '2020-05-27T11:43:24Z'
description: ''
dns_domain: ''
id: 2561f8db-e1c8-4185-9056-0883686a8a53
ipv4_address_scope: null
ipv6_address_scope: null
is_default: false
is_vlan_transparent: null
location:
  cloud: ''
  project:
    domain_id: null
    domain_name: Default
    id: 0e446e02e899455193635c877772fae7
    name: admin
  region_name: regionOne
  zone: null
mtu: 9000
name: infra_external
port_security_enabled: true
project_id: 0e446e02e899455193635c877772fae7
provider:network_type: vlan
provider:physical_network: datacentre
provider:segmentation_id: 4
qos_policy_id: null
revision_number: 2
router:external: true
segments: null
shared: false
status: ACTIVE
subnets:
- bf21b56a-65c4-49fb-b345-b804c0429167
tags: []
updated_at: '2020-05-27T11:43:30Z'

3. Subnet with the corresponding details:
$ openstack subnet show infra_external_subnet -f yaml
allocation_pools:
- end: 172.20.10.250
  start: 172.20.10.70
cidr: 172.20.0.0/16
created_at: '2020-05-27T11:43:30Z'
description: ''
dns_nameservers:
- 8.8.8.8
dns_publish_fixed_ip: null
enable_dhcp: true
gateway_ip: 172.20.0.254
host_routes: []
id: bf21b56a-65c4-49fb-b345-b804c0429167
ip_version: 4
ipv6_address_mode: null
ipv6_ra_mode: null
location:
  cloud: ''
  project:
    domain_id: null
    domain_name: Default
    id: 0e446e02e899455193635c877772fae7
    name: admin
  region_name: regionOne
  zone: null
name: infra_external_subnet
network_id: 2561f8db-e1c8-4185-9056-0883686a8a53
prefix_length: null
project_id: 0e446e02e899455193635c877772fae7
revision_number: 0
segment_id: null
service_types: []
subnetpool_id: null
tags: []
updated_at: '2020-05-27T11:43:30Z'

4. Internal network and a router, with the infra_external network set as the gateway (output provided earlier)

5. Create two VM's, one with a FIP and one directly attached to infra_external

6. Try to ping anything that would need to be routed by the gateway for infra_external_subnet:
gateway_ip: 172.20.0.254

I can ping that gateway fine, it's just when the traffic would need to be routed by 172.20.0.254 that we have an issue.

Versions:
$ cat /etc/redhat-release
CentOS Linux release 8.1.1911 (Core)

# rpm -qa | grep ovn
ovn-20.03.0-2.el8.x86_64
puppet-ovn-17.0.0-0.20200515234945.1d4c0ad.el8.noarch
ovn-host-20.03.0-2.el8.x86_64

$ rpm -qa | grep tripleo-heat-templates
openstack-tripleo-heat-templates-12.2.1-0.20200504123937.29a7fb8.el8.noarch

For the containers, I'm just using current-tripleo, but let me know if there is something else specific that I can get for you:
# podman image list | egrep 'ovn|neutron'
docker.io/tripleomaster/centos-binary-nova-novncproxy current-tripleo 544acd4346da 9 days ago 1.22 GB
docker.io/tripleomaster/centos-binary-neutron-server current-tripleo f19e459a94fd 9 days ago 1.19 GB
docker.io/tripleomaster/centos-binary-ovn-northd current-tripleo 8291433d7448 9 days ago 852 MB
docker.io/tripleomaster/centos-binary-ovn-northd pcmklatest 8291433d7448 9 days ago 852 MB
docker.io/tripleomaster/centos-binary-ovn-controller current-tripleo e8efc9a55bb2 9 days ago 734 MB

I'll share some ovn-trace outputs in the comments. This is getting a bit lengthy.

Expected Results:
OVN shouldn't send an ARP for a routed network.

Severity for me is not very high. It's just a home lab, but if there is a wider issue it could be a problem.

Tags: ovn pc1
Revision history for this message
Brendan Shephard (bshephar) wrote :
Revision history for this message
Brendan Shephard (bshephar) wrote :
Download full text (4.1 KiB)

Two logic switches, one for each network:

()[root@overcloud-controller-0 /]# ovn-nbctl ls-list
e5bcc681-9bec-42b7-bedf-12ce8e9611de (neutron-2561f8db-e1c8-4185-9056-0883686a8a53)
0304d31c-f512-43bc-949e-4d45f754082c (neutron-9d4c5e96-bba6-4716-adb2-3d6c2ddd3903)

()[root@overcloud-controller-0 /]# ovn-nbctl show e5bcc681-9bec-42b7-bedf-12ce8e9611de
switch e5bcc681-9bec-42b7-bedf-12ce8e9611de (neutron-2561f8db-e1c8-4185-9056-0883686a8a53) (aka infra_external)
    port 9075cf11-d5e4-4e60-84f8-5dd38ff72833
        type: localport
        addresses: ["fa:16:3e:91:da:cc 172.20.10.70"]
    port e696d78b-13c4-4781-8bd5-f6a7db16daee
        type: router
        router-port: lrp-e696d78b-13c4-4781-8bd5-f6a7db16daee
    port provnet-2561f8db-e1c8-4185-9056-0883686a8a53
        type: localnet
        tag: 4
        addresses: ["unknown"]
    port 75a72825-0c32-4a86-8896-72b9cbfb6995
        addresses: ["fa:16:3e:47:ee:dd 172.20.10.201"]
()[root@overcloud-controller-0 /]# ovn-nbctl show 0304d31c-f512-43bc-949e-4d45f754082c
switch 0304d31c-f512-43bc-949e-4d45f754082c (neutron-9d4c5e96-bba6-4716-adb2-3d6c2ddd3903) (aka infra_internal)
    port b975d1ca-3b33-4177-bbcb-d07439f1638e
        type: localport
        addresses: ["fa:16:3e:be:97:a0 192.168.10.10"]
    port 65a28088-761c-461c-912c-7d0a3781ab6b
        type: router
        router-port: lrp-65a28088-761c-461c-912c-7d0a3781ab6b
    port 12427559-a937-4b50-a64c-aef54a3284d8
        addresses: ["fa:16:3e:7c:36:ff 192.168.10.102"]

()[root@overcloud-controller-0 /]# ovn-trace infra_internal 'inport == "12427559-a937-4b50-a64c-aef54a3284d8" && eth.src == fa:16:3e:7c:36:ff && ip4.src == 192.168.10.102 && eth.dst == fa:16:3e:be:97:a0 && ip4.dst == 1.1.1.1'
# ip,reg14=0x3,vlan_tci=0x0000,dl_src=fa:16:3e:7c:36:ff,dl_dst=fa:16:3e:be:97:a0,nw_src=192.168.10.102,nw_dst=1.1.1.1,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0

ingress(dp="infra_internal", inport="124275")
---------------------------------------------
 0. ls_in_port_sec_l2 (ovn-northd.c:4516): inport == "124275" && eth.src == {fa:16:3e:7c:36:ff}, priority 50, uuid f869e22a
    next;
 1. ls_in_port_sec_ip (ovn-northd.c:4188): inport == "124275" && eth.src == fa:16:3e:7c:36:ff && ip4.src == {192.168.10.102}, priority 90, uuid ec3f6e49
    next;
 3. ls_in_pre_acl (ovn-northd.c:4706): ip, priority 100, uuid 8ca99cd5
    reg0[0] = 1;
    next;
 5. ls_in_pre_stateful (ovn-northd.c:4895): reg0[0] == 1, priority 100, uuid dd15ba61
    ct_next;

ct_next(ct_state=est|trk /* default (use --ct to customize) */)
---------------------------------------------------------------
 6. ls_in_acl (ovn-northd.c:5086): (!ct.trk || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 0)) && (inport == @pg_63bc7fdf_3061_410f_9e82_80278b987928 && ip4), priority 2002, uuid 655e4046
    next;
19. ls_in_l2_lkup (ovn-northd.c:6757): eth.dst == fa:16:3e:be:97:a0, priority 50, uuid e74c5d8a
    outport = "b975d1";
    output;

egress(dp="infra_internal", inport="124275", outport="b975d1")
--------------------------------------------------------------
 1. ls_out_pre_acl (ovn-northd.c:4708): ip, priority 100, uuid 79c0a63a
    reg0[0] = 1;
    next;
 2. ls_out_pre_stateful (ovn-northd.c...

Read more...

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Should be fixed in OVN master after this patch that merged yesterday:

https://patchwork.ozlabs.org/project/openvswitch/patch/92f6a2f668708c677a8b10b0ac861bfd712f6a20<email address hidden>/

For TripleO jobs we need a new build. Typically we get a Fedora package and rebuild in CBS from it.

Revision history for this message
Brendan Shephard (bshephar) wrote :

Thanks heaps Daniel.

I'll checkout the patch and see if I can compile it myself to test it out.

Revision history for this message
Brendan Shephard (bshephar) wrote :

I believe there is some downstream work going on for the same issue. It appears the downstream tests worked for non-dvr. I have also tried this and can see that it does indeed work if DVR is disabled. Which makes sense given the comments in that patch. But for anyone else who finds this bug. This is what my environment looks like after disabling DVR

router 947f8372-e9bf-4a30-8f8f-7d0c73e30983 (neutron-15c1b81d-b833-4d34-b622-4c6a0bd6c0d7) (aka infra_r1)
    port lrp-65a28088-761c-461c-912c-7d0a3781ab6b
        mac: "fa:16:3e:69:fa:b0"
        networks: ["192.168.10.1/24"]
    port lrp-e696d78b-13c4-4781-8bd5-f6a7db16daee
        mac: "fa:16:3e:0a:c9:40"
        networks: ["172.20.10.118/16"]
        gateway chassis: [a3dfd16b-f743-4258-ab8e-513309166e4e]
    nat a3fa9ffa-7fd6-446a-afd7-5766d6ebc62c
        external ip: "172.20.10.118"
        logical ip: "192.168.10.0/24"
        type: "snat"
    nat d86c6d21-05f0-471a-8315-8c6b78b7bc3c
        external ip: "172.20.10.107"
        logical ip: "192.168.10.102"
        type: "dnat_and_snat"

The Gateway chassis specified:

Chassis "a3dfd16b-f743-4258-ab8e-513309166e4e"
    hostname: overcloud-controller-0.localdomain
    Encap geneve
        ip: "172.16.0.225"
        options: {csum="true"}
    Port_Binding cr-lrp-e696d78b-13c4-4781-8bd5-f6a7db16daee

[fedora@fedora-test ~]$ ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=55 time=15.6 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=55 time=9.82 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=55 time=8.70 ms
64 bytes from 1.1.1.1: icmp_seq=4 ttl=55 time=9.25 ms

So to workaround the issue with tripleo:

parameter_defaults:
  NeutronEnableDVR: false

Revision history for this message
Wojciech (suzumushi) wrote :

for DVR to work right now, rather than waiting for updated packages we had to modify
     - ovn-controller
     - ovn-northd
     - ovn-nb-db-server
     - ovn-sb-db-server
using tripleo-modify-image role and use ovn packages from f32-updates repos
updating below packages with dependencies

openvswitch-2.13.0-1.fc32.x86_64
ovn-20.06.2-1.fc32.x86_64
ovn-host-20.06.2-1.fc32.x86_64
python3-openvswitch-2.13.0-1.fc32.x86_64

it was updated from 1d15ec797db16e5cb22c7a21444b6cbc containers tag.

regards

w

Junien Fridrick (axino)
tags: added: pc1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.