DHCP reserved ports that were unscheduled are advertised as DNS servers

Bug #1852504 reported by Arjun Baindur
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Medium
Unassigned

Bug Description

We have 2 DHCP servers per network. After network outages, and when hosts come back online, the number of ACTIVE DHCP servers grow. This happened again after more outages, with some networks having up to 9-10+ DHCP ports, many in ACTIVE state, despite neutron-server's neutron.conf only having dhcp_agents_per_network = 2

It turns out these are "reserved_dhcp_port" as indicated by the device_id.

As you can see here: https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L399

When a network is rescheduled to a new DHCP agent, the old port is not deleted, not is its status marked as DOWN. All that is done is it is marked as reserved and the port updated.

However VMs on the network now get advertised all the DHCP ports on the network as internal DNS servers, several stale entries in /etc/resolv.conf in our case. Problem is some of these DHCP agents have been unscheduled so the DNS servers don't actually exist. Also in the VMs, more than 3 entries are not queried.

As you can see here, is resolv.conf on a VM:

[root@arjunpmk-master ~]# vim /etc/resolv.conf

# Generated by NetworkManager
search mpt1.pf9.io
nameserver 10.128.144.16
nameserver 10.128.144.23
nameserver 10.128.144.15
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 10.128.144.7
nameserver 10.128.144.4
nameserver 10.128.144.8
nameserver 10.128.144.9
nameserver 10.128.144.17
nameserver 10.128.144.12
nameserver 10.128.144.45
nameserver 10.128.144.46
nameserver 10.128.144.51

Here you can see all the DHCP ports for the network of this VM:

[root@df-us-mpt1-kvm arjun(admin)]# openstack port list --network ead88ed3-f1e0-4498-8c1e-6d091083ae33 --device-owner network:dhcp
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
| 02ff0f4c-f39d-4207-90b4-2a69585f4c8a | | fa:16:3e:a9:36:82 | ip_address='10.128.144.16', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| 0b612f86-ad06-4bce-a333-bc18f3e9e7b1 | | fa:16:3e:bb:d8:3d | ip_address='10.128.144.23', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN |
| 402338ac-2ca6-4312-a2df-a306fc589f10 | | fa:16:3e:a3:a8:57 | ip_address='10.128.144.15', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| 5d2edc73-4eff-44c0-8993-125636973384 | | fa:16:3e:6c:cd:2b | ip_address='10.128.144.7', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| 78241da3-9674-479a-8b45-a580c7f8b117 | | fa:16:3e:d0:9d:ef | ip_address='10.128.144.4', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| 7b41bf47-d4d4-434a-b704-4c67182ffcaa | | fa:16:3e:4c:cf:54 | ip_address='10.128.144.8', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| 96897190-1aa8-4c17-a7d1-c3744f1bf962 | | fa:16:3e:e8:55:29 | ip_address='10.128.144.45', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| af87dde6-fb46-4516-9569-e46496398b64 | | fa:16:3e:0e:61:14 | ip_address='10.128.144.9', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| c2a2112d-c6ef-4411-a415-1a453d74a838 | | fa:16:3e:d0:39:67 | ip_address='10.128.144.46', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN |
| c8298fbd-06e7-4488-a3e1-874e9341d4cf | | fa:16:3e:d6:3c:ac | ip_address='10.128.144.51', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN |
| d6f0206f-ae3c-4ebf-95cb-104dad786724 | | fa:16:3e:ab:ab:22 | ip_address='10.128.144.17', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| e2be0f98-3333-4645-b58a-435e5513a4d3 | | fa:16:3e:b4:ba:c0 | ip_address='10.128.144.12', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN |
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+

If I view the first DNS server for the VM's resolv.conf (10.128.144.16), you can see its status is ACTIVE but its actually a reserved port. This is the same case for 2nd nameserver entry. Luckily the 3rd entry is valid, but this causes timeouts and all DNS lookups to take 10 seconds since first two fail. VMs on other networks aren't so lucky, where all 3 nameservers are reserved.

Expectation: Only DHCP ports that are actually scheduled (not reserved) should be advertised as DNS nameservers. I don't know if this means marking the port as DOWN, or deleting the port when unscheduled.

maybe status needs to also be updated here? https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L417

Revision history for this message
Arjun Baindur (abaindur) wrote :

I actually am not sure if the port status as ACTIVE/DOWN even matters. In my case VM has nameserver 10.128.144.23 as 2nd entry and it is in status DOWN.

I think problem is on agent side here. It appends all ports to list of dns-server DHCP option to advertise, based only on if the device_owner field is "network:dhcp". It doesn't take into account reserved port or status:

https://github.com/openstack/neutron/blob/stable/rocky/neutron/agent/linux/dhcp.py#L1089

Changed in neutron:
status: New → Confirmed
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
assignee: nobody → Mithil Arun (arun-mithil)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/694859

Changed in neutron:
status: Confirmed → In Progress
tags: added: l3-ipam-dhcp
Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Mithil Arun (arun-mithil) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/694859
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.