[segments] dnsmasq can't delete lease for instance due to mismatch between client ip and local addr

Bug #1906406 reported by James Denton
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Medium
Unassigned

Bug Description

Issue:

The Neutron DHCP agent bootstraps the DHCP leases file for a network using all associated subnets[1]. In a multi-segment environment, however, a DHCP agent can only service a single segment/subnet of a given network.

The DHCP namespace, then, is configured with an interface containing a single IP address for the respective segment/subnet it's servicing. When a VM from the same network but different segment/subnet is deleted, the DHCP release packet that would be issued by dhcp_release isn't sent due to a mismatch between client IP and local addr.

Brian Haley patched dhcp_release.c recently to fix a similar issue here:

http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=d9f882bea2806799bf3d1f73937f5e72d0bfc650;hp=fef2f1c75eba56b7355cbe729e4362474d558aa4;ds=sidebyside

We can probably update dnsmasq-utils in the short term, but maybe making the DHCP agent segment aware is a better long-term solution?

Here are the steps to reproduce:

-=-=-=-=-

Network: rpn_multisegment

Segment 1:
VLAN 106 10.106.0.0/24
Provider Mapping: physnet1:bond1

Segment 2:
VLAN 206 10.206.0.0/24
Provider Mapping: physnet2:bond1

Two VMs:

🌕OpenStack Lab % openstack server list
+--------------------------------------+---------------------+---------+-----------------------------------------------+------------------------------+--------------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+---------------------+---------+-----------------------------------------------+------------------------------+--------------------+
| 40f94b68-7e38-45b6-855d-792399c2a9ff | vm-seg2 | ACTIVE | rpn_multisegment=10.206.0.53 | bionic-osa-master | osa-dev-8-8-60 |
| 34f8ff53-e505-4267-a13a-b881dfcec240 | vm-seg1 | ACTIVE | rpn_multisegment=10.106.0.98 | bionic-osa-master | osa-dev-8-8-60 |
+--------------------------------------+---------------------+---------+-----------------------------------------------+------------------------------+--------------------+

On compute01, we can see host file populated with entries for each subnet associated with the network:

root@lab-compute01:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/host
fa:16:3e:07:f7:af,host-10-206-0-2.openstacklocal,10.206.0.2
fa:16:3e:2c:da:6d,host-10-106-0-2.openstacklocal,10.106.0.2
fa:16:3e:46:7b:d1,host-10-106-0-98.openstacklocal,10.106.0.98
fa:16:3e:ce:b1:b5,host-10-206-0-53.openstacklocal,10.206.0.53

Same on compute02:

root@lab-compute02:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/host
fa:16:3e:07:f7:af,host-10-206-0-2.openstacklocal,10.206.0.2
fa:16:3e:2c:da:6d,host-10-106-0-2.openstacklocal,10.106.0.2
fa:16:3e:46:7b:d1,host-10-106-0-98.openstacklocal,10.106.0.98
fa:16:3e:ce:b1:b5,host-10-206-0-53.openstacklocal,10.206.0.53

The leases file, however, contains only those hosts that have obtained leases (expected):

root@lab-compute01:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606916842 fa:16:3e:46:7b:d1 10.106.0.98 host-10-106-0-98 ff:b5:5e:67:ff:00:02:00:00:ab:11:9e:a5:86:fd:ae:2f:49:ad
1606916738 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606916738 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

root@lab-compute02:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606916917 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 ff:b5:5e:67:ff:00:02:00:00:ab:11:9e:a5:86:fd:ae:2f:49:ad
1606916626 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

Everything looks OK so far.

When restarting the neutron-dhcp-agent, however, the leases file is bootstrapped and contains entries for all subnets associated with the network:

root@lab-compute01:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606917246 fa:16:3e:46:7b:d1 10.106.0.98 host-10-106-0-98 *
1606917246 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606917246 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 *
1606917246 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

root@lab-compute02:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606917254 fa:16:3e:46:7b:d1 10.106.0.98 host-10-106-0-98 *
1606917254 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606917254 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 *
1606917254 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

This configuration becomes a problem when a VM is deleted and dhcp_release is executed, as the the namespaces on each host only have an IP from their respective segment and will not be able to delete a lease for what essentially is a non-connected subnet:

root@lab-compute01:~# ip netns exec qdhcp-0e4fa560-1483-4ac5-be44-0542503f1e5a ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ns-5ccc6426-59@if102: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:2c:da:6d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 169.254.169.254/16 brd 169.254.255.255 scope global ns-5ccc6426-59
       valid_lft forever preferred_lft forever
    inet 10.106.0.2/24 brd 10.106.0.255 scope global ns-5ccc6426-59
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe2c:da6d/64 scope link
       valid_lft forever preferred_lft forever

root@lab-compute02:~# ip netns exec qdhcp-0e4fa560-1483-4ac5-be44-0542503f1e5a ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ns-0c51acd3-60@if85: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:07:f7:af brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.206.0.2/24 brd 10.206.0.255 scope global ns-0c51acd3-60
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/16 brd 169.254.255.255 scope global ns-0c51acd3-60
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe07:f7af/64 scope link
       valid_lft forever preferred_lft forever

Example:

🌕OpenStack Lab % openstack server delete vm-seg1

lab-compute01:

Dec 01 13:58:12 lab-compute01 dnsmasq-dhcp[56028]: DHCPRELEASE(ns-5ccc6426-59) 10.106.0.98 fa:16:3e:46:7b:d1
Dec 01 13:58:13 lab-compute01 dnsmasq[56028]: read /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/addn_hosts - 3 addresses
Dec 01 13:58:13 lab-compute01 dnsmasq-dhcp[56028]: read /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/host
Dec 01 13:58:13 lab-compute01 dnsmasq-dhcp[56028]: read /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/opts

root@lab-compute01:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606917246 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606917246 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 *
1606917246 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

lab-compute02:

Dec 01 13:58:13 lab-compute02 neutron-dhcp-agent[48564]: 2020-12-01 13:58:13.946 48564 WARNING neutron.agent.linux.dhcp [-] Could not release DHCP leases for these IP addresses after 3 tries: 10.106.0.98
Dec 01 13:58:14 lab-compute02 dnsmasq[589]: read /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/addn_hosts - 3 addresses
Dec 01 13:58:14 lab-compute02 dnsmasq-dhcp[589]: read /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/host
Dec 01 13:58:14 lab-compute02 dnsmasq-dhcp[589]: read /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/opts

root@lab-compute02:~# cat /var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606917254 fa:16:3e:46:7b:d1 10.106.0.98 host-10-106-0-98 *
1606917254 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606917254 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 *
1606917254 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

As you can see, the lease for 10.106.0.98 was not deleted on compute02, as that segment/subnet is not configured on ns-0c51acd3-60 in the DHCP namespace like it would be in an ordinary provider network.

[1] https://github.com/openstack/neutron/blob/5529b2f5cc6b451c771bc5134018e9dbd2cb6598/neutron/agent/linux/dhcp.py#L758

Revision history for this message
Bernard Cafarelli (bcafarel) wrote :

That sounds interesting indeed, maybe RFE-level (as this would be fixed with making the DHCP agent segments-aware)

tags: added: rfe
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

So is my understanding correct that You propose that DHCP agent which is hosting some network should only configure in dnsmasq IPs from subnet which belongs to the segment to which this host is connected. Is that correct?

Revision history for this message
James Denton (james-denton) wrote :

Yes, I think in the case of segments, if a DHCP agent can only service one segment of a network (which could have many segments) it doesn't make sense to populate the dnsmasq host or leases file with the entire network since the other segments represent a different broadcast domain and will be serviced by their respective agent. But the real crux here is the pre-populated leases file.

FWIW, we compiled dhcp_release from dnsmasq 2.82 source and were able to workaround this behavior.

tags: removed: rfe
Revision history for this message
Bernard Cafarelli (bcafarel) wrote :

OK re-reading it all it could then be fixed by only adding segment-relevant leases when agent starts?

Also, to confirm, no other visible effect than log warnings when trying to release addresses from other segments? (and larger leases file than needed)

Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
James Denton (james-denton) wrote :

> OK re-reading it all it could then be fixed by only adding segment-relevant leases when agent starts?

That's what I'm thinking.

> Also, to confirm, no other visible effect than log warnings when trying to release addresses from other segments? (and larger leases file than needed)

Actually, I meant to add that the existing behavior does cause issues (which led us down this path to begin with). In the environment in question, thousands of instances are spun up and down for testing on a regular basis. The list of addresses that can't be released grows very large over time, and will eventually cause delays in the population of the host file when additional instances are created. The workaround until recently was to restart the DHCP agent a few times a day to reset things, but the problem returns after a few deployments.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.