neutron

[ovn] metadata route missing on the guest

Bug #1959098 reported by Przemyslaw Lal on 2022-01-26

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	New	Undecided	Unassigned
	neutron (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

* High level description

Metadata server (169.254.169.254) is unreachable on VMs attached to only one affected network in the entire cluster. DHCP is enabled on that subnet and VMs get their IP addresses on boot, however the routing rule for metadata is missing:
$ ip r
default via 10.134.253.1 dev eth0
10.134.253.0/24 dev eth0 scope link src 10.134.253.181

Because of that cloud-init metadata requests are being sent to the router rather than ovnmeta netns.

On guests running in the unaffected network, routing table after booting or sending DHCP request looks like this and metadata endpoint is reachable:
$ ip r
default via 172.16.2.1 dev eth0
169.254.169.254 via 172.16.2.10 dev eth0
172.16.2.0/24 dev eth0 scope link src 172.16.2.248

I managed to work this around by manually adding a route to the metadata IP via DHCP port on the router attached to that network, however I believe it should not be needed and such configuration is definitely not present on all the "good" networks on this cluster.

Please let me know what logs and other information would be useful here.

* Step-by-step reproduction steps

1) Create a VM attached to the affected network.
2) Metadata server is unreachable, cloud-init fails because of the missing route not being provided by DHCP server.

* Expected output

I'd expect metadata route to be present on the guest:

$ ip r
default via 10.134.253.1 dev eth0
169.254.169.254 via 10.134.253.2 dev eth0
10.134.253.0/24 dev eth0 scope link src 10.134.253.181

* Actual output:

$ ip r
default via 10.134.253.1 dev eth0
10.134.253.0/24 dev eth0 scope link src 10.134.253.181

* Versions
neutron-common 2:16.4.1-0ubuntu2
neutron-ovn-metadata-agent 2:16.4.1-0ubuntu2
python3-neutron 2:16.4.1-0ubuntu2
python3-neutron-lib 2.3.0-0ubuntu1
python3-neutronclient 1:7.1.1-0ubuntu1
ovn-common 20.03.2-0ubuntu0.20.04.1
ovn-host 20.03.2-0ubuntu0.20.04.1
openvswitch-common 2.13.3-0ubuntu0.20.04.2
openvswitch-switch 2.13.3-0ubuntu0.20.04.2
python3-openvswitch 2.13.3-0ubuntu0.20.04.2
python3-ovsdbapp 1.1.0-0ubuntu2

Host OS: Ubuntu 20.04.3 LTS
Kernel: 5.8.0-48-generic #54~20.04.1-Ubuntu
Deployment: Juju charms

Guest OS: cirros 0.5.2 and Ubuntu 20.04, so most likely all distros are affected

* Environment

42 compute nodes, nova-compute 21.2.2-0ubuntu1 + libvirt 6.0.0-0ubuntu8.14 + KVM.
Deployed with Juju charms.

* Perceived severity

Not a blocker since there is a workaround.

Tags:

Revision history for this message

yatin (yatinkarel) wrote on 2022-01-31:

Hi Przemysław Lal,

What you mean by "affected" network? Do you mean there are multiple networks in your setup, and out of those only one such network is misbehaving in terms of routes?

If above is true, is the "affected" network misbehaving since when it's created, or it used to work earlier and stopped working later? What other differences are there in affected and unaffected networks/subnets?

Following information would be good to collect for affected network, by dropping the workarounds:-
- openstack network show <network id>
- openstack subnet show <subnet id>
- openstack port list --device-owner network:distributed --network <affected network>
- ovn-nbctl find DHCP_Options external_ids:subnet_id=<subnet-id>

And also same info from unaffected network.

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2022-02-01:

Hi Przemysław:

Apart from what Yatin requested, that is necessary to debug this issue, can you confirm "ovn_metadata_enabled" is True? I guess it is because other subnets have this port, but doesn't harm to double check.

Just in case, please check if you have [1] and [2] in your code. If you changed the subnet parameters ("dhcp_enable") without those patches, you could be in an undefined state now.

Can you check also in the server logs, during the network creation, if the metadata port was created too? [3].

Regards.

[1]https://review.opendev.org/q/I05394e49077a72199bbc80c8cb622ec2b17f2fa7
[2]https://review.opendev.org/q/I09cc14dff6933aae63cbd43a29f9221f405ecede
[3]https://github.com/openstack/neutron/blob/e7b70521d0e230143a80974e7e4795a2acafcc9b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L1753

Revision history for this message

Launchpad Janitor (janitor) wrote on 2022-05-12:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status:	New → Confirmed

Revision history for this message

Jose Guedez (jfguedez) wrote on 2022-05-12:

We hit this issue today as well. Same symptoms:

* Failed to get metadata during VM launch - consistently and only on the "affected" network. Other networks like "unaffected" are OK.
* Missing metadata route inside VM
* After adding the route manually to the .2 IP we can ping/curl the metadata endpoint with no issues, so it seems the route is the only thing missing.
* The workaround of adding the metadata route explicitly to the relevant router allows new VMs in the affected network to get metadata without problems.

These are the current packages:

ii neutron-common 2:16.4.2-0ubuntu1
ii neutron-ovn-metadata-agent 2:16.4.2-0ubuntu1
ii python3-neutron 2:16.4.2-0ubuntu1
ii python3-neutron-lib 2.3.0-0ubuntu1
ii python3-neutronclient 1:7.1.1-0ubuntu1

I am attaching the information requested above for an "affected" and "unaffected" network. The main difference I see is that the "unaffected" subnet has the following option in the ovn-nb that is missing from the "affected" subnet:

classless_static_route="{169.254.169.254/32,10.131.83.2, 0.0.0.0/0,10.131.83.1}"

The two patches you mention are indeed included in python3-neutron 2:16.4.2-0ubuntu1. I additionally confirmed by checking /usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py

Regarding "ovn_metadata_enabled", I didn't find it set to "true" in any config under /etc/neutron. I can only see the default commented out and no mention in neutron_ovn_metadata_agent.ini, which has the ovs/ovn config in it (but I am no expert)

/etc/neutron# grep -r ovn_metadata
ovn.ini:#ovn_metadata_enabled = false

The creation logs are no longer available. The ports for the .2 IPs are created in the subnet, and they do have a device_id of ovnmeta-<networkid>, but the device_owner is network:dhcp and not network:distributed as you seem to be expecting. I added the output of `port show` for them as well. Note that other networks on the same compute nodes have no issues providing metadata, including the "unaffected" network (data attached).

We hit this issue today as well. Same symptoms:

These are the current packages:

classless_static_route="{169.254.169.254/32,10.131.83.2, 0.0.0.0/0,10.131.83.1}"

/etc/neutron# grep -r ovn_metadata
ovn.ini:#ovn_metadata_enabled = false

Revision history for this message

Jose Guedez (jfguedez) wrote on 2022-05-12:

LP1959098-unaffected-network.txt Edit (9.9 KiB, text/plain)

unaffected network/subnet information

Revision history for this message

Jose Guedez (jfguedez) wrote on 2022-05-12:

LP1959098-affected-network.txt Edit (9.8 KiB, text/plain)

affected network/subnet information

Revision history for this message

Jose Guedez (jfguedez) wrote on 2022-05-12:

Actually I can confirm ovn_metadata_enabled is set to "True". I was looking in the wrong place (compute/metadata agent server) and this seems to be set in the API server node:

/etc/neutron# grep -r ovn_metadata
plugins/ml2/ml2_conf.ini:ovn_metadata_enabled = True
ovn.ini:#ovn_metadata_enabled = false