nova-network: instance doesn't get an IP address via DHCP

Bug #1457404 reported by Olesia Tsvigun
40
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Won't Fix
Medium
MOS Nova
6.0.x
Won't Fix
Medium
MOS Nova
6.1.x
Won't Fix
Medium
MOS Nova
7.0.x
Won't Fix
Medium
MOS Nova
8.0.x
Won't Fix
Medium
MOS Nova

Bug Description

This is bug related to https://bugs.launchpad.net/mos/6.1.x/+bug/1391010

OSTF test failed because instance doesn't get an address via DHCP:

<28>May 22 15:25:10 node-1 dnsmasq-dhcp[22709]: not using configured address 10.0.0.2 because it is leased to fa:16:3e:81:f2:f4
<28>May 22 15:26:10 node-1 dnsmasq-dhcp[22709]: not using configured address 10.0.0.2 because it is leased to fa:16:3e:81:f2:f4
<28>May 22 15:27:10 node-1 dnsmasq-dhcp[22709]: not using configured address 10.0.0.2 because it is leased to fa:16:3e:81:f2:f4

nova-network run with multi_host=False by default in vCenter env.

Fuel ISO version:443

Looks like there is a race between sending SIGHUP to dnsmasq, so that it can reload hosts file and sending DHCPRELEASE, so that dnsmasq actually updates the in-memory lease DB (http://paste.openstack.org/show/234337/): the config was reloaded successfully upon receiving SIGHUP, but DHCPRELEASE for fa:16:3e:58:cf:7f was never received, so dnsmasq won't offer 10.0.0.2 for anyone else (despite what is state in hosts file, as this lease has already been issued and hasn't expired yet).

dhcp_release util sends a UDP packet to dnsmasq to inform it a specific lease has expired (i.e. pretends to be a client). It seems, that dnsmasq daemon might be in some transitional state (when handling a SIGHUP), when UDP packets send by dhcp_release will be ignored.

==========================================================================

#1 case
        Scenario:
            1. Create cluster with vCenter support
            2. Add 3 nodes with controller roles
            3. Add 2 nodes with compute roles
            4. Configure vCenter as backend for Glance
            5. Deploy the cluster
            6. Run network verification
            7. Run OSTF

Actual result:
Check network connectivity from instance via floating IP (failure):
Instance is not reachable by IP. Please refer to OpenStack logs for more details.
vCenter: Check network connectivity from instance without floating IP (failure):
Instance is not reachable by IP. Please refer to OpenStack logs for more details.

Expected result:
All OSTF tests cases should be passed.

#2 case
        Scenario:
            1. Create cluster with vCenter support
            2. Add 3 nodes with controller role
            3. Add a node with compute role
            4. Add a node with Cinder VMDK role
            5. Set Nova-Network VlanManager as a network backend
            6. Configure vCenter datastore as backend for glance
            7. Deploy the cluster
            8. Run OSTF

Actual result:
Check network connectivity from instance via floating IP (failure):
Instance is not reachable by IP. Please refer to OpenStack logs for more details.
vCenter: Check network connectivity from instance without floating IP (failure):
Instance is not reachable by IP. Please refer to OpenStack logs for more details.

Expected result:
All OSTF tests cases should be passed.

#3 case
        Scenario:
            1. Create cluster with vCenter support
            2. Add 3 nodes with Controller roles
            3. Add 2 nodes with compute role
            4. Deploy the cluster
            5. Run network verification
            6. Run OSTF

Actual result:
Check network connectivity from instance via floating IP (failure):
Instance is not reachable by IP. Please refer to OpenStack logs for more details.
vCenter: Check network connectivity from instance without floating IP (failure):
Instance is not reachable by IP. Please refer to OpenStack logs for more details.

Expected result:
All OSTF tests cases should be passed.

Changed in fuel:
importance: Undecided → High
assignee: nobody → Fuel Partner Integration Team (fuel-partner)
description: updated
Changed in fuel:
status: New → Triaged
Andrian Noga (anoga)
Changed in fuel:
milestone: none → 6.1
Revision history for this message
Olesia Tsvigun (otsvigun) wrote :

Logs are added below.

Changed in fuel:
status: Triaged → In Progress
assignee: Fuel Partner Integration Team (fuel-partner) → Alexander Arzhanov (aarzhanov)
Revision history for this message
Alexander Arzhanov (aarzhanov) wrote :

Some details:
vip__public (ocf::fuel:ns_IPaddr2): Stopped

If i try start vip__public with debug:
[root@node-1 ~]# pcs resource debug-start vip__public

Operation start for vip__public (ocf:fuel:ns_IPaddr2) returned 0
 > stderr: INFO: net.ipv4.ip_forward = 1
 > stderr: /usr/lib/ocf/resource.d/fuel/ns_IPaddr2: line 444: ovs-vsctl: command not found
 > stderr: /usr/lib/ocf/resource.d/fuel/ns_IPaddr2: line 506: ovs-vsctl: command not found

Openvswitch not installed...

summary: - [OSTF] Tests of check network connectivity failed with error 'Instance
- is not reachable by IP'
+ [OSTF] Check network connectivity from instance via floating IP(failed
+ step 5) and vCenter: Check network connectivity from instance without
+ floating IP(failed step 3)
Revision history for this message
Alexander Arzhanov (aarzhanov) wrote : Re: [OSTF] Check network connectivity from instance via floating IP(failed step 5) and vCenter: Check network connectivity from instance without floating IP(failed step 3)
  • case 1 Edit (62.3 MiB, application/octet-stream)
description: updated
Revision history for this message
Alexander Arzhanov (aarzhanov) wrote :
  • case 2 Edit (32.5 MiB, application/octet-stream)
Revision history for this message
Alexander Arzhanov (aarzhanov) wrote :
  • case 3 Edit (52.2 MiB, application/octet-stream)
Revision history for this message
Alexander Arzhanov (aarzhanov) wrote :

Check network connectivity from instance via floating IP(ostf logs):
failed ping floating IP
http://pastebin.com/YNfsrQR2

vCenter: Check network connectivity from instance without floating IP(ostf logs):
failed connect to first online controller(admin PXE network)
http://pastebin.com/NJbhQsB4

Andrey Maximov (maximov)
tags: added: module-ostf
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Guys what exactly do you want to fix here in this case? I look at the logs, and see that ostf try to connect to the floating ip, but actually there was not connectivity
fuel_health.nmanager: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 396, in retry_command
    result = method(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/fuel_health/common/ssh.py", line 166, in exec_command
    strerror=''.join(err_data).join(out_data))
SSHExecCommandFailed: Command 'ping -q -c1 -w10 10.109.6.128', exit status: 1, Error:
PING 10.109.6.128 (10.109.6.128) 56(84) bytes of data.

--- 10.109.6.128 ping statistics ---

10 packets transmitted, 0 received, 100% packet loss, time 9014ms

So seems you need to check the same case manually, I am pretty sure that there is not problems with ostf(at least logs say the same)

tags: removed: module-ostf
Revision history for this message
Alexander Arzhanov (aarzhanov) wrote :

There are no problems with the ostf.

summary: - [OSTF] Check network connectivity from instance via floating IP(failed
- step 5) and vCenter: Check network connectivity from instance without
- floating IP(failed step 3)
+ Instance doesn't get an address via DHCP (nova-network) on dualhv
+ (vCenter)
description: updated
Changed in fuel:
assignee: Alexander Arzhanov (aarzhanov) → MOS Nova (mos-nova)
Changed in fuel:
status: In Progress → New
no longer affects: fuel
summary: - Instance doesn't get an address via DHCP (nova-network) on dualhv
- (vCenter)
+ nova-network: instance doesn't get an IP address via DHCP
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Analysed the logs and updated the bug description.

I honestly don't think it should be treated as High for 6.1, as nova-network is marked as deprecated and multi_host=False is neither default , nor recommended setting for nova-network deployment.

The race condition, how unpleasant it is , can be worked around by putting a sleep between a SIGHUP and DHCPRELEASE sent to dnsmasq daemon. But we'll need to find a cleaner solution for 7.0, obviously.

tags: added: nova
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Roman, as I understand after conversation with Igor Zinovik, issue has High priority for PI-tem.
Igor will provide more details.

Changed in mos:
importance: Medium → High
Revision history for this message
Igor Zinovik (izinovik) wrote :

This problem cannot be considered as Medium.

Yes, nova-network operates in multi host mode (multi_host=True) if you do not use vCenter with
your OpenStack cluster, each Compute node runs its own local nova-network service, but if vCenter
was enabled during cluster creation wizard, nova-network operates in single host mode (multi_host=False) and
is backed up by pacemaker.

It is not possible to run nova-network on ESXi host, that is why we are using multi_host=False with vCenter.

I'm raising severity to High, because this way openstack environment with enabled vCenter
becomes inoperable.

Revision history for this message
Andrian Noga (anoga) wrote :

Roman,
this issue has High priority for PI-team. Izgor Zinovik and Alexander Arzhanov will provide more details

Revision history for this message
Alexander Arzhanov (aarzhanov) wrote :

Folks, cluster with vCenter support(6.1) use ONLY nova-network. In version 7.0, we plan to use neutron.
For cluster with vCenter support this bug is High.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Guys, I'm *not* saying this isn't important. All I'm saying is that the affected configuration is nova-network (which is deprecated in 6.1), multi_host=False (not recommended setting), vcenter (non-default hypervisor with 10x fewer installations rather than qemu/kvm). And it's all about priorities: if we treat all such bugs as High, we'll never reach HCF.

Having said that, I hope this can be worked around easily. We are going to give it a try today.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Today's update: we haven't managed to reproduce this locally yet. Just looking at logs it doesn't seem to be vmware specific, but it *may* have something to do with FlatDHCPManager vs VlanManager differences. It's easy to trigger this synthetically though, by `forgetting` to fire DHCPRELEASE. But it's still unclear how dnsmasq misses it on a real environment (sending SIGHUPs to dnsmasq in a loop didn't make it miss a single DHCPRELEASE sent on a test env). The workaround would be to send a few DHCPRELEASE packets or do that after a short sleep. Although, I'd like us to get a stable repro, before we actually try to do that.

Overall, this condition seems to be rather hard to trigger. IMO, it shouldn't be a blocker for us in 6.1. Still, on request from PI team we continue to track this as a high priority bug.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Today's update: we gave it another try on 3 different environments with nova-network/FlatDHCPManager and haven't managed to reproduce this issue.

We asked Alexander A. to try to reproduce it on the original environment. The *only* repro was right after Alexander reverted the snapshot. After that, we were unable to reproduce this issue again by booting/deleting multiple instances concurrently.

Right now, my understanding is that this is some kind of an edge case, that can be seen only *right after* a snapshot revert, and thus, PI team see Jenkins failures periodically.

Whatever it actually is, the issue only affects particular fixed IPs, until their leases expire, so its influence on the environment is *very* limited (not even saying we can't reproduce it without snapshotting/reverting an environment).

Workaround on nova-network side would be ugly and error-prone. I suggest PI team just decrease lease expire time if you see it in your Jenkins jobs again.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We added additional introspection while reproducing the problem on a clean env.

Right after deployment is complete, when running an OSTF test, the very first instance is created. nova-network lazily creates the bridge (br100) and spawns dnsmasq daemon.

In dnsmasq logs we can see the instance gets an IP address correctly: http://paste.openstack.org/show/239858/
tcpdump logs: http://paste.openstack.org/show/239857/

The problem we can see in the logs is that DHCPRELEASE packet sent by nova-network on behalf of the instance is missing, which means dnsmasq still thinks the lease is used, when it should have expired. The next booted instance won't get an IP address, if the same IP address is allocated by nova-network.

The curios thing is that according to tcpdump logs, DHCPRELEASE has actually been sent correctly (the last one sent from 10.0.0.1, which has length of 548 bytes). And tracing of dnsmasq system calls shows that dnsmasq has seen the message, but for some reason ignored it - http://paste.openstack.org/show/239864/

The subsequent DHCPRELEASE packets are handled correctly. Looks like we ran into some edge case with a newly spawned dnsmasq daemon. We can't reproduce the problem on an existing environment after that.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Summary:

1. This seems to be a dnsmasq issue, as nova-network correctly sends DHCPRELEASE and it's delivered to dnsmasq daemon, but for some reason is ignored.

2. It's very hard to reproduce and can only be seen on a `fresh` env (or right after a revert of a snapshot of a `fresh` env).

3. It affects only particular IP addresses - all other fixed IPs work correctly.

4. Affected IPs will be available again once DHCP leases expire.

5. We (mos-nova) or mos-linux can continue to investigate the dnsmasq issue further, but it shouldn't be treated as High, IMO.

6. As a workaround PI team can decrease the DHCP lease expiration timeout to be something like 5-10 minutes.

Revision history for this message
Ruslan Khozinov (rkhozinov) wrote :

Hi, Roman.

Please, explain how to change DHCP lease time.

Or where can I find the docs for changing this option?

Revision history for this message
Ruslan Khozinov (rkhozinov) wrote :

I've found in the Mirantis Fuel docs the description for dnsmasq.

In the doc author points to the file /etc/cobbler/dnsmasq.template

As I understood, I need to change [<leasetime>] to 5-10 min and redeploy an environment. Am I right?

dhcp-range=<name>,<start-IP-addr>,<end-IP-addr>,<netmask>,[<leasetime>]
dhcp-option=net:<name>,option:router,<IP-addr-of-gateway>
dhcp-boot=net:<name>,pxelinux.0,boothost,<Fuel-Master-IP-addr>

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

@Ruslan, I meant option of nova-network (https://github.com/openstack/nova/blob/master/nova/network/linux_net.py#L63-L65), which must be set in nova.conf on all nodes running nova-network

Changed in mos:
status: In Progress → Confirmed
Revision history for this message
Ruslan Khozinov (rkhozinov) wrote :

@Roman, I've tried to set 5 min manually on controller nodes in (/etc/nova.conf) and by changing the puppet manifests on master node (...modules/nova/quotas.pp).

But I've done these actions on the 477 build and instances cannot get IP addresses. Then I've tried to deploy an env with default (86400) value and I've got the same result.

I'm planning to apply your workaround on the 479 build and I'll provide results.

Revision history for this message
Igor Marnat (imarnat) wrote :

@Ruslan Did you try it? Any update? Does workaround provided by Roman help?

Revision history for this message
Olesia Tsvigun (otsvigun) wrote :

We have tested workaround with expiration timeout 10 minutes and issue reproduced again. It efects more test.

I've attached logs below.

Revision history for this message
Olesia Tsvigun (otsvigun) wrote :
Revision history for this message
Olesia Tsvigun (otsvigun) wrote :
Revision history for this message
Igor Marnat (imarnat) wrote :

@Olesia @Ruslan

Folks, what Roman suggested was:
1. Decrease lease time to 10 mins
2. Wait more than 10 mins before test, than run the test

Everything should work fine. For CI purpose you can even set less than 10 mins, just to try and make it faster.

Did you try to wait a bit more than lease expiration time before test?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

The workaround we proposed is obviously not a silver bullet: it only allows to return affected IPs to normal use after $lease seconds, so by decreasing the lease time, we decrease the time IP addresses can't be used for new instances.

I understand your frustration as this affects your CI. For the purpose of testing we could possibly decrease the lease timeout even more to something like 30-60s. For production 10min or default 24h must be ok.

My point is still the same: from what I see, this looks very much a like a nasty race condition in dnsmasq which can only be reproduced once per deployment right after it's complete (we can't reproduce the issue on the same env after that, neither we can do that on any other deployed env). For some reason dnsmasq ignores the first DHCPRELEASE packet it receives (strace'ing dnsmasq daemon shown it actually received the UDP packet but did nothing). So this has little impact on production envs.

I'm now wondering if this has something to do with virtio, as we've already seen cases when dnsmasq ignored packets with bad checksums (this can only be reproduced with virtio, e1000 works just fine).

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We (mos-nova) or mos-linux can continue to work on this and try to debug dnsmasq itself by the means of gdb, but it will take some time, as this is hard to reproduce (basically, requires a redeploy each time: reverting a snapshot after failed tests is not enough - we need to debug the root cause - dnsmasq ignoring DHCPRELEASE packet, not the implication - dnsmasq not giving leases to new instances).

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Could you guys try to restart the OSFT after $lease seconds? It must pass

Revision history for this message
Olesia Tsvigun (otsvigun) wrote :

@Roman, I will retest it again with restart of the OSTF and add results asap.

Revision history for this message
Olesia Tsvigun (otsvigun) wrote :

I retested with timeout 600 sec after first run OSTF. Issue was reproduced again. Reproducibility of issue 20%. Logs are attached below.

Revision history for this message
Olesia Tsvigun (otsvigun) wrote :
Revision history for this message
Olesia Tsvigun (otsvigun) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Thanks, Olesia!

Unfortunately, something is wrong with your deployment scripts: I've checked the logs and dhcp lease is still 86400 seconds, not 600.

Revision history for this message
Olesia Tsvigun (otsvigun) wrote :

We have tested workaround with dhcp_lease_time=600 at Fuel ISO#521.
In the first OSTF run 'Check network connectivity from instance*' were failed. After $lease seconds 600 all OSTF test cases were passed.
Decrease Importance to Medium

tags: added: release-notes
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Roman, Olesya, how it will affect users?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

@Nastya,

1) we can't reproduce this on non-PI environments (neither locally, nor on BVT/SWARM)
2) PI CI can only reproduce this for 1-2 IP addresses on new deployments - they can't get new IP addresses into this weird state after that
3) those IP addresses will only be affected for a duration of a DHCP lease timeout (by default, 86400s) or before dnsmasq/nova-network are restarted - which happens earlier

I'd say, the impact is fairly small.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Roman,

I saw a report in a neutron bug with similar issue - ' It seems that dnsmasq just isn't getting some DHCPRELEASE packets that dhcp_release is putting on the "wire"'
https://bugs.launchpad.net/neutron/+bug/1271344

And one of the workaround there was bumping up to dnsmasq 2.66 - there's a pointer to a patch for dnsmasq as well. not sure if this helps us. fyi.

Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

I tried to launch/delete 100 instances for nova availability zone and didn't observe the bug.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Thanks, Davanum!

Looks interesting, although we use dnsmasq 2.68 (the latest from Trusty) :(

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Waiting for feedback from Ilya on whether he can reproduce this again.

Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

As I said I didn't observe the issue during adding/deleting a large number of instances. So, taking into account the reproducibility of issue necessary to add appropriate notes to documentation (https://review.openstack.org/#/c/182996/) and add a workaround for automation (https://bugs.launchpad.net/fuel/+bug/1462304).

Changed in mos:
status: Confirmed → Won't Fix
Changed in mos:
importance: High → Medium
assignee: Ilya Bumarskov (ibumarskov) → MOS Nova (mos-nova)
tags: added: release-notes-done
removed: release-notes
tags: added: customer-found
tags: removed: release-notes-done
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 7.0, moved to 8.0

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Per https://bugs.launchpad.net/fuel/+bug/1528613 and the fact we've never seen it reproduced with Neutron I suggest we leave it as won't fix for 8.0

tags: added: wontfix-workaround
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.