Fuel for OpenStack

All nodes in error state after scaling because one compute node was unreachable

Bug #1502295 reported by Mykola Grygoriev on 2015-10-02

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Maciej Kwiek	Fuel for OpenStack 8.0
6.1.x	In Progress	High	MOS Maintenance	Fuel for OpenStack 6.1-updates
7.0.x	In Progress	High	MOS Maintenance	Fuel for OpenStack 7.0-updates

Bug Description

Fuel 6.1.

Short description:
Customer successfully deployed cloud with 20+ compute nodes. He tried to add one more compute node day or two later. Compute node was successfully deployed and then astute task failed on 'uploadfile' step, because 1 compute node was unavailable that time and mcollective agent couldn't reach it. After this astute marked all nodes as "error" and set cloud status to error.
http://paste.openstack.org/show/475207/

Customer plans to use a lot of compute nodes, so one of compute nodes could be unreachable when he will scale up cloud. And by the way, unavailability of one or two compute nodes doesn't affect whole cloud.

Steps to reproduce:
1. Deploy cloud with 1 controller and 2 compute nodes.
2. Make 1 compute node unreachable.
3. Scale up your cloud with 1 more compute node.

Current result:
All nodes after scale up when 1 compute node is unreachable will be in error state.

Expected result.
Only unreachable node after scale up when 1 compute node is unreachable will be in error state.

Tags:

Stanislaw Bogatkin (sbogatkin) on 2015-10-05

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Fuel Python Team (fuel-python)
milestone:	none → 6.1-updates

Dmitry Pyzhov (dpyzhov) on 2015-10-06

tags:

removed: critical

Dmitry Pyzhov (dpyzhov) on 2015-10-08

tags:	added: tricky
no longer affects:	fuel/8.0.x

Revision history for this message

Maciej Kwiek (maciej-iai) wrote on 2015-10-08:

There is a workaround for this bug: when the node goes offline, you should remove it (there is an option for removing offline nodes in web ui). After the offline node is removed, you are able to deploy any new changes.

Revision history for this message

Maciej Kwiek (maciej-iai) wrote on 2015-10-08:

There should be a warning in UI (or CLI) if you are running deployment with offline, not removed nodes.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-09: Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/233024

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-10-09:

I don't think this properly addresses the problem. The root issue here is that in a multi-node task if one node fails all nodes in the task are marked as failed. It happens when the task it's self fails too. In the event of a task being run on a production cloud it sets the entire cloud to failed. After this the orchestrator want's to re-run all task on all nodes to resolve it.

This is further compounded by the start of a task removing the pending state, not the completion.

Bottom line, only the node(s) failed in a task should be marked as error, and only the not completed tasks should be identified to run the next time changes are deployed.

Revision history for this message

Maciej Kwiek (maciej-iai) wrote on 2015-10-12:

As I see it - my patch fixes the bug, but it doesn't resolve the root cause which is lack of fault tolerance in post-deployment phase. I think this issue should be handled in separate, more general launchpad bug/blueprint.

Revision history for this message

Ihor Kalnytskyi (ikalnytskyi) wrote on 2015-10-12:

@Andrew,

We can't address the issue "do not mark all nodes in error state" right now, since it's our limitation. I mean, in post deployment stage we have tasks which are critical for clusters (such as enable_quorum) as well as not critical (upload cirros or update host).

So if post deployment task has been failed, we mark entire deployment in error state, because we can't say whether cluster is operational or not. I think we can go with @Maciej's fix for now, and take in mind for general solution that should be addressed as a blueprint.

@Maciej,

Just come to mind, what do you think if we also mark **offline** nodes in **error**, so user will notice that updates wasn't applied there? It's ugly, but will notify a cluster operator that redeployment is needed for these nodes.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-13: Change abandoned on fuel-web (master)

Change abandoned by Maciej Kwiek (<email address hidden>) on branch: master
Review: https://review.openstack.org/233024
Reason: After discussing this with loles, the change needs to be done in Astute.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-14: Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234657

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-python

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-27: Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/234657
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=d10e78b0b03e64751adf197c4a5921fb2430059c
Submitter: Jenkins
Branch: master

commit d10e78b0b03e64751adf197c4a5921fb2430059c
Author: Maciej Kwiek <email address hidden>
Date: Wed Oct 14 11:15:47 2015 +0200

All offline nodes are removed as failed nodes

    remove_failed_nodes took only newly deployed nodes uids into
    consideration for checking for offline nodes. Now all nodes in cluster
    are checked for being available.

Change-Id: Ifbdec3d6f8cd1b2751afb45c185efd5c5316a817
Closes-bug: #1502295

Changed in fuel:
status:	In Progress → Fix Committed

Andrey Lavrentyev (alavrentyev) on 2015-11-23

tags:

added: on-verification

Revision history for this message

Andrey Lavrentyev (alavrentyev) wrote on 2015-11-24:

#10

Fuel 8.0 has been verified on ISO #185

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "185"
  build_id: "185"
  fuel-nailgun_sha: "7d7366c2ec9b46e4ac90d9c6d3c9e7b87e40ac14"
  python-fuelclient_sha: "e685d68c1c0d0fa0491a250f07d9c3a8d0f9608c"
  fuel-agent_sha: "6f3026d8c8e0927ee8fdf9d3171d506674cc7130"
  fuel-nailgun-agent_sha: "16f5c1a1575a6b482f5159dd2e4b255c03167a7e"
  astute_sha: "c8400f51b0b92254da206de55ef89d17fdf35393"
  fuel-library_sha: "9e565fa8550c78e6391e1da10c07f8be3d329dec"
  fuel-ostf_sha: "c2e1fa0ca859c163a7ff445a70f1264d6be0893b"
  fuel-createmirror_sha: "994fed9b1ed889718b61a59733275c08c2dd4c64"
  fuelmenu_sha: "d12061b1aee82f81b3d074de74ea27a6e962a686"
  shotgun_sha: "c377d163519f6d10b69a654019d6086ba5f14edc"
  network-checker_sha: "a57e1d69acb5e765eb22cab0251c589cd76f51da"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "cd084cf5c4372a46184fb7c2f24568da4e030be2"

tags:

removed: on-verification

Dmitriy Kruglov (dkruglov) on 2015-11-27

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-25: Fix proposed to fuel-astute (stable/6.1)

#11

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/272008

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix proposed to fuel-astute (stable/7.0)

#12

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/272429

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2016-02-02:

#13

Guys, this patch breaks Reduced Footprint re-installation case where we always have offline controller node during the compute's re-installation process.
https://bugs.launchpad.net/fuel/+bug/1539460/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-25: Change abandoned on fuel-astute (stable/6.1)

#14

Change abandoned by Tony Breeds (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/272008
Reason: This branch (stable/6.1) is at End Of Life

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-25: Change abandoned on fuel-astute (stable/7.0)

#15

Change abandoned by Tony Breeds (<email address hidden>) on branch: stable/7.0
Review: https://review.openstack.org/272429
Reason: This branch (stable/7.0) is at End Of Life

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.