DHCP reserved ports that were unscheduled are advertised as DNS servers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Confirmed
|
Medium
|
Unassigned |
Bug Description
We have 2 DHCP servers per network. After network outages, and when hosts come back online, the number of ACTIVE DHCP servers grow. This happened again after more outages, with some networks having up to 9-10+ DHCP ports, many in ACTIVE state, despite neutron-server's neutron.conf only having dhcp_agents_
It turns out these are "reserved_
As you can see here: https:/
When a network is rescheduled to a new DHCP agent, the old port is not deleted, not is its status marked as DOWN. All that is done is it is marked as reserved and the port updated.
However VMs on the network now get advertised all the DHCP ports on the network as internal DNS servers, several stale entries in /etc/resolv.conf in our case. Problem is some of these DHCP agents have been unscheduled so the DNS servers don't actually exist. Also in the VMs, more than 3 entries are not queried.
As you can see here, is resolv.conf on a VM:
[root@arjunpmk-
# Generated by NetworkManager
search mpt1.pf9.io
nameserver 10.128.144.16
nameserver 10.128.144.23
nameserver 10.128.144.15
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 10.128.144.7
nameserver 10.128.144.4
nameserver 10.128.144.8
nameserver 10.128.144.9
nameserver 10.128.144.17
nameserver 10.128.144.12
nameserver 10.128.144.45
nameserver 10.128.144.46
nameserver 10.128.144.51
Here you can see all the DHCP ports for the network of this VM:
[root@df-
+------
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+------
| 02ff0f4c-
| 0b612f86-
| 402338ac-
| 5d2edc73-
| 78241da3-
| 7b41bf47-
| 96897190-
| af87dde6-
| c2a2112d-
| c8298fbd-
| d6f0206f-
| e2be0f98-
+------
If I view the first DNS server for the VM's resolv.conf (10.128.144.16), you can see its status is ACTIVE but its actually a reserved port. This is the same case for 2nd nameserver entry. Luckily the 3rd entry is valid, but this causes timeouts and all DNS lookups to take 10 seconds since first two fail. VMs on other networks aren't so lucky, where all 3 nameservers are reserved.
Expectation: Only DHCP ports that are actually scheduled (not reserved) should be advertised as DNS nameservers. I don't know if this means marking the port as DOWN, or deleting the port when unscheduled.
maybe status needs to also be updated here? https:/
Changed in neutron: | |
status: | New → Confirmed |
Changed in neutron: | |
importance: | Undecided → Medium |
Changed in neutron: | |
assignee: | nobody → Mithil Arun (arun-mithil) |
tags: | added: l3-ipam-dhcp |
Changed in neutron: | |
status: | New → Confirmed |
I actually am not sure if the port status as ACTIVE/DOWN even matters. In my case VM has nameserver 10.128.144.23 as 2nd entry and it is in status DOWN.
I think problem is on agent side here. It appends all ports to list of dns-server DHCP option to advertise, based only on if the device_owner field is "network:dhcp". It doesn't take into account reserved port or status:
https:/ /github. com/openstack/ neutron/ blob/stable/ rocky/neutron/ agent/linux/ dhcp.py# L1089