Nodes stuck in "provisioning" state despite successful reboot into newly installed OS
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Georgy Kibardin | ||
6.1.x |
Won't Fix
|
High
|
Sergii Rizvan | ||
7.0.x |
Won't Fix
|
High
|
Sergii Rizvan | ||
8.0.x |
Fix Released
|
High
|
Sergii Rizvan | ||
Mitaka |
Fix Released
|
High
|
Georgy Kibardin |
Bug Description
According to the output of `fuel nodes' almost half of nodes stuck in the `provisioning' state.
However in fact those node have successfully rebooted into the newly installed OS a long time ago (~ 20 -- 30 minutes).
The problem is that the mcollective service uses a wrong config file which specifies
activemq as a messaging backend, as a result the node can't notify fuel that provisioning has
successfuly completed. The /var/log/
# Logfile created on 2016-05-24 11:51:19 +0000 by logger.rb/31641
I, [2016-05-
I, [2016-05-
W, [2016-05-
I, [2016-05-
I, [2016-05-
I, [2016-05-
I, [2016-05-
I, [2016-05-
[repeated many many times]
There's also an error message regarding a failed mcollective configuration attempt in
/var/log/
Reading package lists...
Building dependency tree...
Reading state information...
mcollective is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
start: Job is already running: mcollective
2016-05-24 11:51:19,768 - util.py[WARNING]: Running mcollective (<module 'cloudinit.
Currently, cloud-init configures mcollective and then runs the following command
service mcollective start
However nailgun-agent also adjusts mcollective config (/etc/mcollecti
the race and starts mcollective before cloud-init have configured it
the above start command does nothing. As a result the node can't report that the provisioning has been completed successfully, so deployment never starts.
In order to make sure that mcollective is able to see config file changes made by cloud-init it must be restarted.
* Steps to reproduce:
- Deploy a minimal cluster: 3 controllers, 2 computes, 3 ceph-osd nodes
* Expected results:
- Provisioning and deployment completes normally within the sane time (~ 30 -- 60 minutes)
* Actual results:
- Quite a number of nodes (almost a half) stuck in the `provisioning' state
* Workaround:
- manually restart the mcollective service on the affected nodes
Changed in fuel: | |
milestone: | none → 7.0 |
Changed in fuel: | |
assignee: | Fuel Python Team (fuel-python) → Kamil Sambor (ksambor) |
Changed in fuel: | |
status: | Triaged → In Progress |
assignee: | Kamil Sambor (ksambor) → Vladimir Kozhukalov (kozhukalov) |
tags: | added: customer-found |
no longer affects: | fuel/newton |
tags: | added: fuel-py |
tags: |
added: fuel-python removed: fuel-py |
tags: |
added: area-python removed: fuel-python |
description: | updated |
description: | updated |
tags: | added: on-verification |
tags: | added: on-verification |
tags: | added: on-verification |
Moved back to 6.1. If we want to move it to 7.0, we need to explain why we do not want to fix it in 6.1 and explain the user impact.