neutron

DHCP reserved ports that were unscheduled are advertised as DNS servers

Bug #1852504 reported by Arjun Baindur on 2019-11-13

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Confirmed	Medium	Unassigned

Bug Description

We have 2 DHCP servers per network. After network outages, and when hosts come back online, the number of ACTIVE DHCP servers grow. This happened again after more outages, with some networks having up to 9-10+ DHCP ports, many in ACTIVE state, despite neutron-server's neutron.conf only having dhcp_agents_per_network = 2

It turns out these are "reserved_dhcp_port" as indicated by the device_id.

As you can see here: https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L399

When a network is rescheduled to a new DHCP agent, the old port is not deleted, not is its status marked as DOWN. All that is done is it is marked as reserved and the port updated.

However VMs on the network now get advertised all the DHCP ports on the network as internal DNS servers, several stale entries in /etc/resolv.conf in our case. Problem is some of these DHCP agents have been unscheduled so the DNS servers don't actually exist. Also in the VMs, more than 3 entries are not queried.

As you can see here, is resolv.conf on a VM:

[root@arjunpmk-master ~]# vim /etc/resolv.conf

# Generated by NetworkManager
search mpt1.pf9.io
nameserver 10.128.144.16
nameserver 10.128.144.23
nameserver 10.128.144.15
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 10.128.144.7
nameserver 10.128.144.4
nameserver 10.128.144.8
nameserver 10.128.144.9
nameserver 10.128.144.17
nameserver 10.128.144.12
nameserver 10.128.144.45
nameserver 10.128.144.46
nameserver 10.128.144.51

Here you can see all the DHCP ports for the network of this VM:

If I view the first DNS server for the VM's resolv.conf (10.128.144.16), you can see its status is ACTIVE but its actually a reserved port. This is the same case for 2nd nameserver entry. Luckily the 3rd entry is valid, but this causes timeouts and all DNS lookups to take 10 seconds since first two fail. VMs on other networks aren't so lucky, where all 3 nameservers are reserved.

Expectation: Only DHCP ports that are actually scheduled (not reserved) should be advertised as DNS nameservers. I don't know if this means marking the port as DOWN, or deleting the port when unscheduled.

maybe status needs to also be updated here? https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L417

Tags:

Revision history for this message

Arjun Baindur (abaindur) wrote on 2019-11-13:

I actually am not sure if the port status as ACTIVE/DOWN even matters. In my case VM has nameserver 10.128.144.23 as 2nd entry and it is in status DOWN.

I think problem is on agent side here. It appends all ports to list of dns-server DHCP option to advertise, based only on if the device_owner field is "network:dhcp". It doesn't take into account reserved port or status:

https://github.com/openstack/neutron/blob/stable/rocky/neutron/agent/linux/dhcp.py#L1089

Dr. Jens Harbott (j-harbott) on 2019-11-14

Changed in neutron:
status:	New → Confirmed

Swaminathan Vasudevan (swaminathan-vasudevan) on 2019-11-15

Changed in neutron:
importance:	Undecided → Medium

Mithil Arun (arun-mithil) on 2019-11-18

Changed in neutron:
assignee:	nobody → Mithil Arun (arun-mithil)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-19: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/694859

Changed in neutron:
status:	Confirmed → In Progress

Slawek Kaplonski (slaweq) on 2019-11-19

tags:

added: l3-ipam-dhcp

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2020-09-25: auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee:	Mithil Arun (arun-mithil) → nobody
status:	In Progress → New
tags:	added: timeout-abandon

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25: Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/694859
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Dr. Jens Harbott (j-harbott) on 2020-09-28

Changed in neutron:
status:	New → Confirmed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.