Reinstall virt node fails with `Critical nodes are not available for deployment`

Bug #1539460 reported by Ksenia Svechnikova
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Vladimir Sharshov
8.0.x
Won't Fix
Medium
Fuel Library (Deprecated)
Mitaka
Won't Fix
Medium
Fuel Library (Deprecated)

Bug Description

MOS 8.0 ISO #483

Steps:

  1. deploy env with 3 virt,compute and 3 controllers
  2. Change disk.yaml for node-7(virt,compute) and change keep_data for vm to True
       http://paste.openstack.org/show/485369/
  3. Provision node-7 (fuel node --node 7 --provision)
  4. Run spawn vm task: (fuel2 env spawn-vms)
  5. wait for VMs to comes up
  6. deploy virt node

This workflow fail on step 4. We get controller in error:

[root@fuel-support4-mos8 ~]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|-------------|------------------|---------|---------------|-------------------|---------------|---------------|--------|---------
4 | error | Untitled (ec:3c) | None | 172.16.58.200 | 0c:c4:7a:15:ec:3c | | | True | None
18 | ready | Untitled (2e:65) | 1 | 172.16.58.207 | 52:54:00:87:2e:65 | controller | | True | 1
17 | ready | Untitled (9a:1f) | 1 | 172.16.58.206 | 52:54:00:d5:9a:1f | controller | | True | 1
15 | discover | Untitled (fb:25) | None | 172.16.58.209 | 52:54:00:da:fb:25 | | | False | None
7 | provisioned | Untitled (4a:90) | 1 | 172.16.58.201 | 0c:c4:7a:13:4a:90 | compute, virt | | True | 1
8 | ready | Untitled (ed:46) | 1 | 172.16.58.203 | 0c:c4:7a:15:ed:46 | compute, virt | | True | 1
9 | ready | Untitled (4c:88) | 1 | 172.16.58.202 | 0c:c4:7a:13:4c:88 | compute, virt | | True | 1

Puppets on node-7:

[root@fuel-support4-mos8 ~]# less /var/log/docker-logs/remote/172.16.58.201/puppet-agent.log
2016-01-29T08:02:50.280233+00:00 err: Could not run: SIGTERM

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ksenia, please provide more logs and/or a diagnostic snapshot.

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

Sure, snapshot is uploaded on google drive, as its rather huge:
https://drive.google.com/file/d/0B2v38w72jlwTZThtREZodHhvQUU/view?usp=sharing

description: updated
no longer affects: mos
no longer affects: mos/8.0.x
no longer affects: mos/9.0.x
description: updated
Revision history for this message
slava valyavskiy (slava-val-al) wrote :

Typo in fuel-astute has been found:
https://github.com/openstack/fuel-astute/blob/master/lib/astute/task_deployment.rb#L131

For some reason provisioning task has been triggered for spawned VM what has offline state:

2016-01-29 10:57:47 DEBUG [671] Data received by DeploymentProxyReporter to report it up:
{"nodes"=>
  [{"uid"=>"16",
    "status"=>"error",
    "error_type"=>"provision",
    "role"=>"hook",
    "error_msg"=>
     "Node is not ready for deployment: mcollective has not answered"}],
 "error"=>"Node is not ready for deployment"}

2016-01-29 10:57:47 DEBUG [671] Data send by DeploymentProxyReporter to report it up:
{"nodes"=>
  [{"uid"=>"16",
    "status"=>"error",
    "error_type"=>"provision",
    "role"=>"hook",
    "error_msg"=>
     "Node is not ready for deployment: mcollective has not answered"}],
 "error"=>"Node is not ready for deployment"}

tags: added: team-telco
tags: added: team-mixed
removed: team-telco
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/274604

Andrey Maximov (maximov)
tags: added: move-to-mu
Andrey Maximov (maximov)
tags: removed: move-to-mu
Revision history for this message
slava valyavskiy (slava-val-al) wrote :

2016-02-02 20:18:53 INFO [676] Using Astute::DeploymentEngine::GranularDeployment for deployment.
2016-02-02 20:19:03 DEBUG [676] 9dce78dc-0dbb-405a-bf23-54ad3e691526: MC agent 'systemtype', method 'get_type', results:
{:sender=>"6", :statuscode=>0, :statusmsg=>"OK", :data=>{:node_type=>"target"}}

2016-02-02 20:19:03 DEBUG [676] 9dce78dc-0dbb-405a-bf23-54ad3e691526: MC agent 'systemtype', method 'get_type', results:
{:sender=>"7", :statuscode=>0, :statusmsg=>"OK", :data=>{:node_type=>"target"}}

2016-02-02 20:19:03 DEBUG [676] 9dce78dc-0dbb-405a-bf23-54ad3e691526: MC agent 'systemtype', method 'get_type', results:
{:sender=>"19",
 :statuscode=>0,
 :statusmsg=>"OK",
 :data=>{:node_type=>"target"}}

2016-02-02 20:19:03 DEBUG [676] 9dce78dc-0dbb-405a-bf23-54ad3e691526: MC agent 'systemtype', method 'get_type', results:
{:sender=>"5", :statuscode=>0, :statusmsg=>"OK", :data=>{:node_type=>"target"}}

2016-02-02 20:20:43 DEBUG [676] Data received by DeploymentProxyReporter to report it up:
{"nodes"=>
  [{"uid"=>"18",
    "status"=>"error",
    "error_type"=>"provision",
    "role"=>"hook",
    "error_msg"=>
     "Node is not ready for deployment: mcollective has not answered"},
   {"uid"=>"20",
    "status"=>"error",
    "error_type"=>"provision",
    "role"=>"hook",
    "error_msg"=>
     "Node is not ready for deployment: mcollective has not answered"}],
 "error"=>"Node is not ready for deployment"}

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

It seems that astute checks all target systems before the deployment and as we have one of the controllers offlineit sends negative response to nailgun...

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

The problem commit has been identified: https://review.openstack.org/#/c/234657
It's not possible to deploy any cluster in 8.0 release if offline nodes are present.

tags: added: area-python move-to-mu
removed: area-library team-mixed
Revision history for this message
Szymon Banka (sbanka) wrote :

Changed milestone to 8.0-mu-1.
Should be back ported to 8.0-MU as soon as it's fixed in 9.0.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/274604
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=1771e78645a9599de7c7406e9c458c76165529cf
Submitter: Jenkins
Branch: master

commit 1771e78645a9599de7c7406e9c458c76165529cf
Author: Alexander Saprykin <email address hidden>
Date: Mon Feb 1 12:21:06 2016 +0100

    Fix typo in TaskDeployment.critical_node_uids

    Change-Id: I31dc549c4c4be6626ccbb648dc611d60c8121669
    Related-Bug: #1539460

Dmitry Pyzhov (dpyzhov)
tags: added: team-mixed
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Alexander Saprykin (cutwater) wrote :

According to the comment from Slava Valyavskiy, this bug is a regression that was caused by patch merged into astute. Looks like it should be assigned to the fuel-library team first.

tags: added: area-library
removed: area-python
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

We passed SCF for 9.0. Moving the medium priority bug to 10.0 release.

Changed in fuel:
milestone: 9.0 → 10.0
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

We run several actions for this env:

2016-01-28 14:12:33 INFO [681] Processing RPC call 'granular_deploy'
2016-01-28 14:20:27 INFO [667] Processing RPC call 'granular_deploy'
2016-01-28 14:37:39 INFO [659] Processing RPC call 'reset_environment'
2016-01-28 14:45:46 INFO [659] Processing RPC call 'execute_tasks'
2016-01-28 15:10:25 INFO [652] Processing RPC call 'remove_nodes'
2016-01-28 15:11:26 INFO [671] Processing RPC call 'remove_nodes'
2016-01-28 15:11:53 INFO [657] Processing RPC call 'image_provision'
2016-01-28 16:19:58 INFO [657] Processing RPC call 'granular_deploy'
2016-01-28 17:26:17 INFO [676] Processing RPC call 'image_provision'
2016-01-28 17:29:24 INFO [676] Processing RPC call 'granular_deploy'
2016-01-28 18:08:13 INFO [681] Processing RPC call 'granular_deploy'
2016-01-29 07:54:38 INFO [667] Processing RPC call 'image_provision'
2016-01-29 08:16:54 INFO [659] Processing RPC call 'granular_deploy'
2016-01-29 08:26:23 INFO [652] Processing RPC call 'dump_environment'

We got fail on node 16. Critical nodes are not available for deployment: ["16"]. This is critical role primary-controller.
The last succeed action with it was 2016-01-28.

But as i can see in fuel node output, we do not have in list.

So my thoughts based on current info that problem is simple offline primary controller, because i do not see any actions
in Astute log which remove it from cluster.

In other hands: in what time was done? '''fuel node''' report which do not include such node as 16 but include node 4 which do not have role at all. I do not see any info in Astute log about this node 4, which has ready status.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Fuel Library Team (fuel-library) → Vladimir Sharshov (vsharshov)
Revision history for this message
Bartosz Kupidura (zynzel) wrote :

This is normal with 'virt' role. Controller is offline because it is running as VM spawned on 'virt' node.
Without spawn-vms API call, this controller will never comes up.

Changed in fuel:
status: Incomplete → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: removed: move-to-mu
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.