Two identical ip on different nodes during discovering

Bug #1418921 reported by Oleksandr Liemieshko
56
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Ivan Kliuk
5.1.x
Fix Committed
Medium
Fuel Python (Deprecated)
6.0.x
Won't Fix
High
MOS Maintenance

Bug Description

Two identical ip on different nodes during discovering

[root@fuel ~]# fuel nodes
id | status | name | cluster | ip | mac | roles | pending_roles | online
---|----------|------------------|---------|-------------|-------------------|-------|---------------|-------
13 | discover | Untitled (c4:93) | None | 10.20.0.128 | 08:00:27:bb:c4:93 | | | True
12 | discover | Untitled (ec:d5) | None | 10.20.0.129 | 08:00:27:3c:ec:d5 | | | False
10 | discover | Untitled (23:c5) | None | 10.20.0.7 | 12:36:56:17:46:49 | | | True
11 | discover | Untitled (9d:18) | None | 10.20.0.128 | 08:00:27:3b:9d:18 | | | False
9 | discover | Untitled (6b:48) | None | 10.20.0.6 | 0e:b5:73:14:5c:43 | | | True

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "48"
  build_id: "2014-12-03_01-07-36"
  astute_sha: "ef8aa0fd0e3ce20709612906f1f0551b5682a6ce"
  fuellib_sha: "a3043477337b4a0a8fd166dc83d6cd5d504f5da8"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "500e36d08a45dbb389bf2bd97673d9bff48ee84d"
  fuelmain_sha: "7626c5aeedcde77ad22fc081c25768944697d404"

Diagnostic snapshot
https://drive.google.com/a/mirantis.com/file/d/0B-l2g_sTQureT1UzcG5GOVhxdWs/view?usp=sharing

Revision history for this message
Dima Shulyak (dshulyak) wrote :

One of them are offline.

11 | discover | Untitled (9d:18) | None | 10.20.0.128 | 08:00:27:3b:9d:18 | | | False

After it will be back to inline - ip will be changed

Changed in fuel:
status: New → Invalid
Revision history for this message
Oleksandr Liemieshko (oliemieshko) wrote :

Here's what I got

[root@fuel nailgun]# fuel nodes
id | status | name | cluster | ip | mac | roles | pending_roles | online
---|--------------|------------------|---------|-------------|-------------------|-------------------|---------------|-------
9 | provisioned | Untitled (6b:48) | 4 | 10.20.0.129 | 08:00:27:cf:6b:48 | controller | | True
13 | provisioning | Untitled (c4:93) | 4 | 10.20.0.128 | 08:00:27:bb:c4:93 | ceph-osd, compute | | True
12 | provisioned | Untitled (ec:d5) | 4 | 10.20.0.6 | 08:00:27:3c:ec:d5 | ceph-osd, compute | | True
10 | provisioned | Untitled (23:c5) | 4 | 10.20.0.128 | 08:00:27:65:23:c5 | controller | | True
11 | provisioned | Untitled (9d:18) | 4 | 10.20.0.5 | 08:00:27:3b:9d:18 | controller | | True

Changed in fuel:
status: Invalid → New
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

We also have 2 similar reports from 5.0-5.1 customers.
Root cause still unknown, issue disappears after Nailgun DB cleanup..

tags: added: customer-found
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please provide reproducing steps

Changed in fuel:
status: New → Incomplete
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Since the behavior changes after Nailgin DB cleanup, I assign this issue back to the Fuel python team

Revision history for this message
Oleksandr Liemieshko (oliemieshko) wrote :

No specific steps how reproduce it. It was simple deployment, but with result like above

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 5.1.2 → 6.1
Revision history for this message
Evgeny Kozhemyakin (ekozhemyakin) wrote :

It may be not the same case, I got doublicate ips during provisioning (not after bootstrap). But as the bug is dificult to reproduce I'm attaching I'm attaching the snapshot here. Hope it'll help.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

It's not clear if it one issue or two. Let's check if we can find a root cause.

Changed in fuel:
importance: Undecided → Medium
status: Incomplete → Confirmed
Revision history for this message
Dima Shulyak (dshulyak) wrote :

Guys, can you clarify - it is just an issue with outdated information in nailgun database? Or it is real ip addresses conflict?

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Let me provide node state from last snapshot

id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|----------|------------------|---------|------------|-------------------|---------------------------|---------------|--------|---------
10 | discover | Untitled (bc:10) | None | 10.20.0.9 | 52:54:d0:15:bc:10 | | | True | None
8 | error | Untitled (1b:e9) | 7 | 10.20.0.6 | fe:23:39:ec:9a:4d | controller | | True | 7
9 | discover | Untitled (c7:5e) | None | 10.20.0.11 | 52:54:47:7f:c7:5e | | | True | None
7 | error | Untitled (f0:fb) | 7 | 10.20.0.9 | 5e:6a:b2:5c:03:46 | ceph-osd, cinder, compute | | True | 7

And It is clear that nailgun assigned different ips for each created cobbler system, 0.8 and 0.10 - this is correct and expected behavior.
But it seems they failed to provison and get this new ips, that is why db state stuck in 0.6 and 0.9.
Unfortunately there is no logs at all apart from db and cobbler rpc info - so it is still unclear what was the reason that provisioning failed.

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Alexander Lemeshko, here is database state from first snapshot

id | status | name | cluster | ip | mac | roles | pending_roles | online
---|--------|------------------|---------|-----------|-------------------|-------------------|---------------|-------
10 | ready | Untitled (23:c5) | 4 | 10.20.0.4 | c2:f8:31:57:47:4c | controller | | True
11 | ready | Untitled (9d:18) | 4 | 10.20.0.5 | 5e:df:a0:17:45:4a | controller | | True
13 | error | Untitled (c4:93) | 4 | 10.20.0.7 | f2:cc:ed:ad:34:43 | ceph-osd, compute | | True
12 | ready | Untitled (ec:d5) | 4 | 10.20.0.6 | c2:79:b0:a0:ef:47 | ceph-osd, compute | | True
9 | ready | Untitled (6b:48) | 4 | 10.20.0.3 | 0e:c7:9c:76:45:4a | controller | | True

All ips are different and were updated after nodes were provisioned.

Please copy /var/lib/dnsmasq/dnsmasq.leases from cobbler container, if you will suspect conflicts of ipaddresses

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Serg Lystopad (slystopad) wrote :

I've also experiensed this bug. Duplicate addresses assigned on empty cloud (discovery step).

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
auth_required: true
build_id: 2015-02-13_14-40-09
build_number: '19'
feature_groups:
- experimental
fuellib_sha: 45658cb2bc473302fc4034afd510ea33c8e286a9
fuelmain_sha: 7b6086ae71fbd4355417243c61033ee9f1eecc3c
nailgun_sha: 6967b24adc4d74e36d69b59973ff79d6ab2389e5
ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
production: docker
release: 6.0.1
release_versions:
  2014.2-6.0.1:
    VERSION:
      api: '1.0'
      astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
      build_id: 2015-02-13_14-40-09
      build_number: '19'
      feature_groups:
      - experimental
      fuellib_sha: 45658cb2bc473302fc4034afd510ea33c8e286a9
      fuelmain_sha: 7b6086ae71fbd4355417243c61033ee9f1eecc3c
      nailgun_sha: 6967b24adc4d74e36d69b59973ff79d6ab2389e5
      ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
      production: docker
      release: 6.0.1

fuel node output
http://paste.openstack.org/show/189347/

cobbler container output
http://paste.openstack.org/show/189328/

Revision history for this message
Serg Lystopad (slystopad) wrote :

I've logged in to the failed node via IPMI and ifdown/ifup AdminPXE interface
Node obtained distinct IP
Mar 5 17:08:31 dnsmasq-dhcp[15657]: DHCPDISCOVER(eth0) 38:ea:a7:11:2d:84
Mar 5 17:08:31 dnsmasq-dhcp[15657]: DHCPOFFER(eth0) 10.20.0.15 38:ea:a7:11:2d:84
Mar 5 17:08:31 dnsmasq-dhcp[15657]: DHCPREQUEST(eth0) 10.20.0.15 38:ea:a7:11:2d:84
Mar 5 17:08:31 dnsmasq-dhcp[15657]: DHCPACK(eth0) 10.20.0.15 38:ea:a7:11:2d:84

and was successfully discovered by fuel master.

Revision history for this message
Dima Shulyak (dshulyak) wrote :

sergiy, thanks for input, afaik there was such problem and

# for many simultaneously DHCPDISCOVVER requests dnsmasq can offer
# the same IP for two differnt MAC addresses. This option prevents it
# by assigning IPs one by one instead of using hash algorithm.
dhcp-sequential-ip

was used to prevent it, but aparently it wasnt enough

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Fuel Library Team (fuel-library)
status: Incomplete → Confirmed
Revision history for this message
Serg Lystopad (slystopad) wrote :

I collected tcpdump output during boot up of cluster nodes

Changed in fuel:
importance: Medium → High
Revision history for this message
Aleksandr (avorobiov) wrote :

Have the same issue.

# fuel --fuel-version
api: '1.0'
astute_sha: 16b252d93be6aaa73030b8100cf8c5ca6a970a91
auth_required: true
build_id: 2015-03-26_02-23-45
build_number: '38'
feature_groups:
- experimental
fuellib_sha: a5d85f73a0bd9643234c7212f1a1658c99269b7d
fuelmain_sha: 81d38d6f2903b5a8b4bee79ca45a54b76c1361b8
nailgun_sha: 72241e9b66ba194221594a44526afc9e2da180f7
ostf_sha: a9afb68710d809570460c29d6c3293219d3624d4
production: docker
release: '6.0'
release_versions:
  2014.2-6.0:
    VERSION:
      api: '1.0'
      astute_sha: 16b252d93be6aaa73030b8100cf8c5ca6a970a91
      build_id: 2015-03-26_02-23-45
      build_number: '38'
      feature_groups:
      - experimental
      fuellib_sha: a5d85f73a0bd9643234c7212f1a1658c99269b7d
      fuelmain_sha: 81d38d6f2903b5a8b4bee79ca45a54b76c1361b8
      nailgun_sha: 72241e9b66ba194221594a44526afc9e2da180f7
      ostf_sha: a9afb68710d809570460c29d6c3293219d3624d4
      production: docker
      release: '6.0'

Here part of log dnsmasq log:

Apr 2 01:57:03 dnsmasq-dhcp[22554]: DHCPOFFER(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 01:57:08 dnsmasq-dhcp[22554]: DHCPREQUEST(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 01:57:08 dnsmasq-dhcp[22554]: DHCPACK(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 01:57:08 dnsmasq-dhcp[22554]: PXE(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c pxelinux.0
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPOFFER(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPREQUEST(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPACK(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 02:05:45 dnsmasq-dhcp[24949]: not using configured address 10.200.8.40 because it is leased to ec:f4:bb:c7:ff:9c
Apr 2 02:05:51 dnsmasq-dhcp[24949]: not using configured address 10.200.8.40 because it is leased to ec:f4:bb:c7:ff:9c
Apr 2 02:06:43 dnsmasq-dhcp[24949]: DHCPOFFER(eth0) 10.200.8.40 ec:f4:bb:ce:89:ac
Apr 2 02:06:43 dnsmasq-dhcp[24949]: DHCPREQUEST(eth0) 10.200.8.40 ec:f4:bb:ce:89:ac
Apr 2 02:06:43 dnsmasq-dhcp[24949]: DHCPACK(eth0) 10.200.8.40 ec:f4:bb:ce:89:ac node-8
Apr 2 02:12:10 dnsmasq-dhcp[25836]: DHCPOFFER(eth0) 10.200.8.40 ec:f4:bb:ce:89:ac
Apr 2 02:12:18 dnsmasq-dhcp[25836]: DHCPREQUEST(eth0) 10.200.8.40 ec:f4:bb:ce:89:ac
Apr 2 02:12:18 dnsmasq-dhcp[25836]: DHCPACK(eth0) 10.200.8.40 ec:f4:bb:ce:89:ac node-8
Apr 2 02:12:18 dnsmasq-dhcp[25836]: PXE(eth0) 10.200.8.40 ec:f4:bb:ce:89:ac pxelinux.0

and part of dnsmasq.conf

dhcp-host=net:x86_64,ec:f4:bb:c7:ff:9c,node-10.getty.local,10.200.8.43
dhcp-host=net:x86_64,ec:f4:bb:ce:89:ac,node-8.getty.local,10.200.8.40

Revision history for this message
Aleksandr (avorobiov) wrote :

It seems dnsmasq has some kind of priority issue.
Firstly it assigns IP address sequentially one by one from the pool:

Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPDISCOVER(eth0) ec:f4:bb:ce:89:ac
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPOFFER(eth0) 10.200.8.39 ec:f4:bb:ce:89:ac
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPREQUEST(eth0) 10.200.8.39 ec:f4:bb:ce:89:ac
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPACK(eth0) 10.200.8.39 ec:f4:bb:ce:89:ac
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPDISCOVER(eth0) ec:f4:bb:c7:ff:9c
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPOFFER(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPREQUEST(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c
Apr 2 01:58:57 dnsmasq-dhcp[22554]: DHCPACK(eth0) 10.200.8.40 ec:f4:bb:c7:ff:9c

Then it tries to assign IP address from config:

Apr 2 02:05:45 dnsmasq-dhcp[24949]: not using configured address 10.200.8.40 because it is leased to ec:f4:bb:c7:ff:9c
Apr 2 02:05:48 dnsmasq-dhcp[24949]: DHCPDISCOVER(eth0) ec:f4:bb:ce:89:ac
Apr 2 02:05:48 dnsmasq-dhcp[24949]: DHCPOFFER(eth0) 10.200.8.75 ec:f4:bb:ce:89:ac

Revision history for this message
Alex Schultz (alex-schultz) wrote :

So I was looking at this bug and in looking into the "dhcp-sequential-ip" option for dnsmasq, the documentation seems to indicate that you shouldn't use it because it will cause ips to change if you let the lease expire. Specifically, the documentation[1] says:

--dhcp-sequential-ip
Dnsmasq is designed to choose IP addresses for DHCP clients using a hash of the client's MAC address. This normally allows a client's address to remain stable long-term, even if the client sometimes allows its DHCP lease to expire. In this default mode IP addresses are distributed pseudo-randomly over the entire available address range. There are sometimes circumstances (typically server deployment) where it is more convenient to have IP addresses allocated sequentially, starting from the lowest available address, and setting this flag enables this mode. Note that in the sequential mode, clients which allow a lease to expire are much more likely to move IP address; for this reason it should not be generally used.

My assumption is that a node fails to checkin for a while and then the ip gets assigned to the next device checking in. Then when the first node comes back it gets a different ip address. Where there any network/power interruptions on these nodes for ~1 hour (I think is default lease TTL)? It might be beneficial to remove the dhcp-sequential-ip configuration from dnsmasq which would reduce the likelihood of this occurring. Alternatively increasing the lease TTL might also reduce the likelihood of this occurring when a node is off/unavailable for extended periods of time. Can anyone confirm the case where one of these nodes might have been off/not checking in to provisioning for an extended amount of time?

[1] http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html

Revision history for this message
Alex Schultz (alex-schultz) wrote :

I was able to replicate by doing the following:

1) start cluster, have all nodes check in.
2) stop two nodes (slave-2, slave-3)
3) remove the two nodes from /var/lib/dnsmasq/dnsmasq.leases within the cobbler container.
4) restart cobbler container
5) start the node who had the higher IP (slave-3)
6) the node will come up with the lower IP address (slave-2)

The output of fuel node list will show the same IP address for node 3 as node 2 but node 2 is shown as offline.
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|----------|------------------|---------|-----------|-------------------|-------|---------------|--------|---------
2 | discover | Untitled (f3:09) | None | 10.20.0.4 | 08:00:27:ed:f3:09 | | | False | None
1 | discover | Untitled (9a:74) | None | 10.20.0.3 | 08:00:27:17:9a:74 | | | True | None
3 | discover | Untitled (96:48) | None | 10.20.0.4 | 08:00:27:dd:96:48 | | | True | None

7) If you start node-2, it will come up but get a new IP address and be reporting as online.
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|----------|------------------|---------|-----------|-------------------|-------|---------------|--------|---------
2 | discover | Untitled (f3:09) | None | 10.20.0.5 | 08:00:27:ed:f3:09 | | | True | None
1 | discover | Untitled (9a:74) | None | 10.20.0.3 | 08:00:27:17:9a:74 | | | True | None
3 | discover | Untitled (96:48) | None | 10.20.0.4 | 08:00:27:dd:96:48 | | | True | None

In looking at the original report for this bug, it seems to report this specific issue. For nodes that are listed for the 10.20.0.128 address, one is online (online=True) and one is offline (online=False). From my testing it does not appear that this is a problem as when the node that is offline comes back and gets the new IP, the addresses are updated in the system. I would propose that this be set to Invalid as there isn't any actual issue with this occurring when this happens.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislaw Bogatkin (sbogatkin)
Revision history for this message
Łukasz Oleś (loles) wrote :

Guys, you can't just remove dhcp-sequential-ip option. It was added for a reason: https://review.openstack.org/#/c/127946/

I see few solutions now:
- increase lease TTL value to very big number. It will not fix the problem but will decrease chance of its occurrence .
- fix dnsmasq hashing algorithm, so the bug with the same hash for different MACs will be fixed.
- write own dhcp service which generates IP addresses in a better way.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

If dig into bug reporter snapshot, we'll see:

Feb 3 08:07:16 dnsmasq-dhcp[917]: DHCPDISCOVER(eth0) 08:00:27:3b:9d:18
Feb 3 08:07:16 dnsmasq-dhcp[917]: DHCPOFFER(eth0) 10.20.0.128 08:00:27:3b:9d:18
Feb 3 08:07:16 dnsmasq-dhcp[917]: DHCPREQUEST(eth0) 10.20.0.128 08:00:27:3b:9d:18
Feb 3 08:07:16 dnsmasq-dhcp[917]: DHCPACK(eth0) 10.20.0.128 08:00:27:3b:9d:18

Feb 3 08:07:17 dnsmasq-dhcp[917]: DHCPDISCOVER(eth0) 08:00:27:bb:c4:93
Feb 3 08:07:17 dnsmasq-dhcp[917]: DHCPOFFER(eth0) 10.20.0.130 08:00:27:bb:c4:93
Feb 3 08:07:17 dnsmasq-dhcp[917]: DHCPREQUEST(eth0) 10.20.0.130 08:00:27:bb:c4:93
Feb 3 08:07:17 dnsmasq-dhcp[917]: DHCPACK(eth0) 10.20.0.130 08:00:27:bb:c4:93

and after that:
Feb 5 14:19:59 dnsmasq-dhcp[839]: DHCPDISCOVER(eth0) 08:00:27:bb:c4:93
Feb 5 14:19:59 dnsmasq-dhcp[839]: DHCPOFFER(eth0) 10.20.0.128 08:00:27:bb:c4:93
Feb 5 14:19:59 dnsmasq-dhcp[839]: DHCPREQUEST(eth0) 10.20.0.128 08:00:27:bb:c4:93
Feb 5 14:19:59 dnsmasq-dhcp[839]: DHCPACK(eth0) 10.20.0.128 08:00:27:bb:c4:93

Feb 5 14:20:37 dnsmasq-dhcp[839]: DHCPDISCOVER(eth0) 08:00:27:3b:9d:18
Feb 5 14:20:37 dnsmasq-dhcp[839]: DHCPOFFER(eth0) 10.20.0.130 08:00:27:3b:9d:18
Feb 5 14:20:40 dnsmasq-dhcp[839]: DHCPREQUEST(eth0) 10.20.0.130 08:00:27:3b:9d:18
Feb 5 14:20:40 dnsmasq-dhcp[839]: DHCPACK(eth0) 10.20.0.130 08:00:27:3b:9d:18

So, actually, nodes got different IPs, but nailgun doesn't update it rightfully. Seems like a medium nailgun bug for me.

BTW, Aleksandr (avorobiov), could you, please, attach diagnostic snapshot from your environment, cause it is hard to understand what's happened in your env w/o full dnsmasq log at least? And if it shows that it is really some races in dnsmasq itself, please, file a new bug about it, cause this one pretty sure related to nailgun, not to dnsmasq.

Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → Fuel Python Team (fuel-python)
importance: High → Medium
no longer affects: fuel/7.0.x
Revision history for this message
Roman Alekseenkov (ralekseenkov) wrote :

Updating priority to High for customer-found issues, so they don't get moved to 7.0.
This one seems like a race condition that needs to be fixed

Nikolay Markov (nmarkov)
no longer affects: fuel/6.1.x
Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

We don't really have any evidence that this bug is confirmed in 6.1. We are mixing here problems at discovery and deployment stages.
Additionally, there is no real impact, node will get new ip once it is online. DB just keeps last known ip

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If there is no real impact, let's lower this to a medium and move to the 7.0

Changed in fuel:
importance: High → Medium
milestone: 6.1 → 7.0
importance: Medium → High
milestone: 7.0 → 6.1
Revision history for this message
Łukasz Oleś (loles) wrote :

For 5.1 and for 6.0 and higher may be different reasons for it.

@aliemieshko Can you check what were an actual IPs on nodes?

Revision history for this message
Oleksandr Liemieshko (oliemieshko) wrote :

@Łukasz Oleś No, i don't have this env anymore. I got this bug only once. After this i have never seen this bug again. And unfortunately I don't have steps how to reproduce it again

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Maybe mos-linux team could help with dnsmasq issue. Also let's not close this bug as Invalid. Looks like this issue is rare but it appears from time to time in users deployments. So we should continue investigation in 7.0.

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Alexei Sheplyakov (asheplyakov)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> - fix dnsmasq hashing algorithm, so the bug with the same hash for different MACs will be fixed

First of all there are no clear evidences that dnsmasq assigns the same IP to different clients.
Secondly a good 48 bits -> 8 bits hash function is hardly possible.

Revision history for this message
Łukasz Oleś (loles) wrote :

> First of all there are no clear evidences that dnsmasq assigns the same IP to different clients.

Of course there are, This is why dhcp-sequential-ip option option was add to Fuel in 6.0.

Just use this MACs:
0c:c4:7a:1d:91:64
0c:c4:7a:1d:93:da

0c:c4:7a:1d:90:fe
0c:c4:7a:1d:92:76

and start simultaneously all VMs. dnsmasq will assign the same IP to both pairs. If you start VMs one by one it will be ok because VM will send DHCPACK to dnsmasq and IP will be reserved.
It was debuged in https://bugs.launchpad.net/fuel/+bug/1378000

Revision history for this message
Andrew Woodward (xarses) wrote :

deployment failed on this env, and after attempting to reset / delete cluster I found that it had duplicate IP's

[root@fuel ~]# fuel node
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|--------------|------------------|---------|-----------|-------------------|--------------------|---------------|--------|---------
6 | discover | Untitled (36:5e) | 1 | 10.20.0.9 | 00:0c:29:8d:36:5e | cinder, controller | | True | 1
7 | discover | Untitled (ac:40) | 1 | 10.20.0.7 | 00:0c:29:c0:ac:40 | cinder, controller | | True | 1
4 | provisioning | Untitled (18:3e) | 1 | 10.20.0.8 | 00:0c:29:4d:18:48 | compute | | True | 1
1 | discover | Untitled (f9:fd) | 1 | 10.20.0.8 | 00:0c:29:5e:f9:fd | cinder, controller | | True | 1
5 | discover | Untitled (4b:fb) | 1 | 10.20.0.6 | 00:0c:29:1a:4b:fb | compute | | True | 1

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Andrew Woodward (xarses) wrote :

I spoke with Ryan at length about this, I don't see us having an easy time preventing this either way. However we must not allow fuel to do anything with duplicate nodes, but at the same time we must record information about duplicate nodes.

I propose that we continue to allow duplicate ip on the node entity and when a duplicate is seen on update that we set both node entity's into error state and generate a message to the notifications. This way the operator will be able to see the two duplicate nodes, their information and can address the issue. This will additionally prevent the node's from participating in activities such as deployment that would cause it to fail because the wrong node is responding to mcollective.

Andrew Woodward (xarses)
Changed in fuel:
assignee: Alexei Sheplyakov (asheplyakov) → Fuel Python Team (fuel-python)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Of course there are, This is why dhcp-sequential-ip option option was add to Fuel in 6.0.

I'm unable to reproduce the problem.
I've written a weird DHCP client [0] which asks for an IP (using a fake MAC address)
and releases it after a short period (possibly zero) period of time. I've run the thing nightly,
and haven't found any duplicate IPs.

[0] https://github.com/asheplyakov/dhcpflooder

Revision history for this message
Łukasz Oleś (loles) wrote :

Two clients need to ask for IP simultaneously. Before any of them ACKs it. Both will get the same IP, then there is a race. One will ACK it and second will fail or boot with boot the same IP. I don't remember :/

The problem is in dnsmasq. It doesn't mark an IP as used before it's acked and it sends one IP to both servers.

Revision history for this message
Andrew Woodward (xarses) wrote :

loles, yes. I understand that we can run into this, however as I noted. We improve fuel so that we don't do bad things when this occurs and notify the operator and how they may be able to fix it.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Łukasz,

> Two clients need to ask for IP simultaneously.

The test [0] runs 4 clients (with the "magic" MAC addresses you've mentioned previously)
which repeatedly discover/request/release. Every client gets a unique IP.

> The problem is in dnsmasq.

There's no evidences of that.

[0] https://github.com/asheplyakov/dhcpflooder/blob/master/stress-test.sh

Ivan Kliuk (ivankliuk)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Ivan Kliuk (ivankliuk)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> fix dnsmasq hashing algorithm, so the bug with the same hash for different MACs will be fixed.

Assigning IPv4 addresses in a stateless manner (i.e. based on the client MAC address only) is impossible.

Proof:
Assume the IP address allocated for a given client does not depend on IPs of other nodes.
Thus all 253 possible addresses (of a /24 subnet) are equally likely.
Then the probability of there not being any two clients having the same IP is

p_unique = (1 - 1/253)*(1 - 2/253)*...*(1 - (N-1)/253)

where N is a total number of clients.

Hence the probability of collision (that is, assigning the same IP to two or more clients) is

p_coll = 1 - p_unique

Note that p_coll = 54% for N = 20, and p_coll = 95% for N = 39.

Basically this means that the IP computed from the client MAC only most likely collides with the IP allocated for another client.
Therefore any correct algorithm is stateful, so the IP leased to a given client can not be stable (it depends on IPs of other clients, on the order in which clients ask for an IP, etc).

Revision history for this message
Łukasz Oleś (loles) wrote :

@Alexei

I'm little lost. What exactly are you trying to prove here? The problem exists. It's described in previous bug and it's described on dnsmasq mailing list http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2010q2/003881.html This happened many times in 100 nodes lab. We added 'dhcp-sequential-ip' to fix it and it solved the problem. Duplicate IPs never again happened in scale lab.

If dhcp-sequential-ip causes different problems we should focus on solving it.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Looks like dhcp-sequential-ip should fix this issue. It was merged in 6.1 and 6.0 branches. We were not able to reproduce this bug and we have no evidence that this bug is not fixed with this option. Moving to Fix Committed for 6.1 and 6.0.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Łukasz,

> What exactly are you trying to prove here?

1) dnmasq does NOT lease duplicate IPs to clients
2) nailgun should not assume the node IP is set in the stone
3) As a temporary work around (util nailgun get fixed) one could use /16 network instead of /24 one

> It's described in previous bug and it's described on dnsmasq mailing list http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2010q2/003881.html

That's not a bug, the reporter misunderstands the DHCP protocol.
The address is not assigned to the client until the server has ACK'ed it.
DHCPOFFERing the same IP to different clients is explicitly permitted by RFC 2131 [1]:

   2. Each server may respond with a DHCPOFFER message that includes an
      available network address in the 'yiaddr' field (and other
      configuration parameters in DHCP options). Servers need not
      reserve the offered network address, although the protocol will
      work more efficiently if the server avoids allocating the offered
      network address to another client.

[1] https://tools.ietf.org/html/rfc2131#section-3.1

Revision history for this message
Łukasz Oleś (loles) wrote :

1) dnmasq does NOT lease duplicate IPs to clients

Ok

2) nailgun should not assume the node IP is set in the stone

It should not. If IP is not updated in db then this is a bug which we need to fix.

3) As a temporary work around (util nailgun get fixed) one could use /16 network instead of /24 one

It may work :)

> That's not a bug, the reporter misunderstands the DHCP protocol.
Ok, thanks for clarification. It looks like not every hardware provider knows it because when dnsmasq offered the same IP to both servers one of them failed :/

Revision history for this message
Dmitry Nikishov (nikishov-da) wrote :

Is there any way to fix it for 6.0/6.0.1? It keeps happening at customer's lab.

Roman Rufanov (rrufanov)
tags: added: support
Revision history for this message
Roman Rufanov (rrufanov) wrote :

Customer found on 6.0.1 in Prod. Please provide a fix.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 6.0-updates as there is no way to deliver Fuel fixes in maintenance updates for 6.0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.