If one node goes offline during provisioning step, all deployment will be failed

Bug #1546604 reported by Sergey Galkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Sharshov
8.0.x
Fix Committed
High
Michael Polenchuk

Bug Description

If even one of nodes became in offline status, deployment becomes failed status.

_____________________________________
ORIGINAL DESCRIPTION:
Steps to reproduce
1. Deploy Fuel in kvm
2. Start deploy cluster with 190 HW nodes

Deployment failed because all controllers switched to offline.

In the screens on the controllers I see error 'try to load pxelinux.cfg/MAC'
Snapshot - http://mos-scale-share.mirantis.com/fuel-snapshot-2016-02-17_13-49-11.tar.gz

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Sergey Galkin (sgalkin) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "573"
  build_id: "573"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "c2a335b5b725f1b994f78d4c78723d29fa44685a"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "643a1ef27c7dccc1c2a2ad26b85c09226b35a67d"

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Changed in fuel:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 9.0
tags: added: area-library
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Reproduced on the same env after redeployment
But
1. On part of nodes the Ubuntu installed
2. Switched to offline 52 compute-ceph nodes

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Assigning to fuel-python team.

In short: target nodes were provisioned, and then were rebooted. All nodes were unable to boot.

From first look it look like an issue with bootloader installation which is done during provisioning.

so, to find out the root cause, a one needs to analyze fuel-agent and nailgun agent logs, as well as syslog/kernel messages from any of target node.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

it's high as major feature gets broken.

Changed in fuel:
importance: Medium → High
tags: added: area-python
removed: area-library
tags: added: tricky
tags: added: module-astute
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
Download full text (6.0 KiB)

http://paste.openstack.org/show/487455/

long story short, what actually happened:

1) provisioning of 50 target nodes started.

2016-02-17 17:57:25 INFO [1071] Starting OS provisioning for nodes: 102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,12
5,126,127,128,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152

2) it went smooth. all changes were successfully applied to cobbler node profiles. Then uploading of provision data started (provision.json). Technically, that uploading is implemented via mcollective service.

3) provision.json was uploaded to nodes 102,103,104,105,106,107,108,109,110,111.

4) for some reasons, the next target node 112 was offline at this moment, hence uploading failed.

last entries in log files ended at 17:30:30

2016-02-17T17:30:30.578712+00:00 debug: 17:30:30.400765 #2746] DEBUG -- : runnerstats.rb:56:in `block in sent' Incrementing replies stat
2016-02-17T17:30:30.578844+00:00 warning: 17:30:30.405476 #2746] WARN -- : netio.rb:387:in `_init_line_read' PLMC7: Exiting after signal: SignalException: SIGTERM
2016-02-17T17:30:30.578844+00:00 debug: 17:30:30.405615 #2746] DEBUG -- : rabbitmq.rb:350:in `disconnect' Disconnecting from RabbitMQ
2016-02-17T17:30:30.578968+00:00 info: 17:30:30.405943 #2746] INFO -- : rabbitmq.rb:20:in `on_disconnect' Disconnected from stomp://mcollective@10.20.0.2:61613

5) astute did 10 retries with no luck.
2016-02-17 17:58:38 DEBUG [1071] Retry #1 to run mcollective agent on nodes: '112'
2016-02-17 17:59:41 DEBUG [1071] Retry #2 to run mcollective agent on nodes: '112'
2016-02-17 18:00:43 DEBUG [1071] Retry #3 to run mcollective agent on nodes: '112'
2016-02-17 18:01:46 DEBUG [1071] Retry #4 to run mcollective agent on nodes: '112'
2016-02-17 18:02:49 DEBUG [1071] Retry #5 to run mcollective agent on nodes: '112'
2016-02-17 18:03:51 DEBUG [1071] Retry #6 to run mcollective agent on nodes: '112'
2016-02-17 18:04:54 DEBUG [1071] Retry #7 to run mcollective agent on nodes: '112'
2016-02-17 18:05:56 DEBUG [1071] Retry #8 to run mcollective agent on nodes: '112'
2016-02-17 18:06:59 DEBUG [1071] Retry #9 to run mcollective agent on nodes: '112'
2016-02-17 18:08:02 DEBUG [1071] Retry #10 to run mcollective agent on nodes: '112'

6) astute gave up with trace:
2016-02-17 18:09:04 ERROR [1071] MCollective agents 'uploadfile' '112' didn't respond within the allotted time.
 trace:
["/usr/share/gems/gems/astute-8.0.0/lib/astute/mclient.rb:114:in `check_results_with_retries'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/mclient.rb:60:in `method_missing'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:46:in `upload_provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `block in provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:296:in `image_provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:241:in `block in provision_piece'",
 "/usr/share/gems/gems/astu...

Read more...

Changed in fuel:
status: Confirmed → Triaged
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)
Revision history for this message
Leontiy Istomin (listomin) wrote :

The nodes have been failed due network connectivity issues. But as @agorgeev mentioned earlier when some nodes goes offline we shouldn't fail deployment at all.

summary: - Controllers fail to boot during deployment
+ If one node goes offline during provisioning step, all deployment will
+ be failed
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/288113

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/288113
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=79f99adf48de37d33b5e089472f91b2f7e614e55
Submitter: Jenkins
Branch: master

commit 79f99adf48de37d33b5e089472f91b2f7e614e55
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Thu Mar 3 23:25:26 2016 +0300

    Flexible way to work with node provision

    Changes:

    - use upload file task instead of magent directly;
    - fault tolerance for uploading errors;
    - big refactoring of image provision;
    - add missing tests for image provision.

    Change-Id: I70169855082c899cb287ff5a10c907d90b3f81b5
    Closes-Bug: #1546604

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/322770

Andrew Kalach (akndex)
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/8.0)

Reviewed: https://review.openstack.org/322770
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=17ddf0ecac92475287266179828a6cc03967c876
Submitter: Jenkins
Branch: stable/8.0

commit 17ddf0ecac92475287266179828a6cc03967c876
Author: Michael Polenchuk <email address hidden>
Date: Mon May 30 13:58:40 2016 +0300

    Prevent unexpected exception if provision fail

    Squashed commits from the 9.0:
    - 79f99adf48de37d33b5e089472f91b2f7e614e55
      - fault tolerance for uploading errors
      - use upload file task instead of magnet directly
    - e07e74eb5980421b47fbc64b6d6f50a955e7cad1
      - do not fail if no nodes were sent to reboot

    Change-Id: I5b806f3d1411c4445a58b899b73eca035f5931b9
    Closes-Bug: #1546604
    Related-Bug: #1540360

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.