Destroy filesystem of provisioned node if call stop provision when node was reboot with installed os

Bug #1316583 reported by Vladimir Sharshov
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Vladimir Sharshov
5.0.x
Won't Fix
Medium
Vladimir Sharshov

Bug Description

Destroy filesystem of provisioned node if call stop provision when node was reboot with installed os.

This is happen, because shell script which run by Astute fail in provisioned node, because damaged filesystem and do not delete mbr. After restart if node have priority to boot from hard drive instead of network, it will boot and get fatal system error.

By default priority to boot from hard drive instead of network in VirtualBox.

Two ways to solve:
- fix Astute part to call Ruby script instead of shell script for provisioned node;
- change order in VirtualBox and add special note about importance of booting order.

tags: added: release-notes
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/92376

Changed in fuel:
status: New → In Progress
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/92376
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=5ce3d348201dc45bf8c96f47dd2f19593ddf20a7
Submitter: Jenkins
Branch: master

commit 5ce3d348201dc45bf8c96f47dd2f19593ddf20a7
Author: Dmitry Pyzhov <email address hidden>
Date: Tue May 6 17:59:26 2014 +0400

    Change boot order in virtualbox scripts

    Master node will boot from disk/cdrom/net
    Slave nodes will boot from net/disk
    It is done in order to meet our requirements:
    http://docs.mirantis.com/fuel/fuel-4.1/install-guide.html#fuel-installation-procedures

    Partial-Bug: #1316583
    Change-Id: I2bdd929e48cf5c4cc687b7ef77ea749cc8c1750e

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Dmitry Pyzhov (lux-place) → Fuel Astute Team (fuel-astute)
status: In Progress → Confirmed
importance: High → Medium
milestone: 5.0 → 5.1
Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Added to list of Known Issues in 5.0 Release Notes.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

In this report Artem Panchenko (apanchenko-8) suggest possible fix: https://bugs.launchpad.net/fuel/+bug/1321095

I try to check it and it is solve half of the problem. This changes really clear mbr, but we get "EXT4-fs error: file system corruption" and lose ability to restart system.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Vladimir, I can't be sure, but when i tried to change ssh_erase_nodes.rb script and first time and then restarted 'astute', no changes were applied and orchestrator logs contained old commands. I also found the way to save changes:

1. Inside astute container replace /usr/lib64/ruby/gems/2.1.0/gems/astute-0.0.2/lib/astute/ssh_actions/ssh_erase_nodes.rb file
2. Commit changes to the docker repo (docker commit <astute_container_id> fuel/astute_5.0)
3. Stop astute container (dockerctl stop astute)
4. Start astute container (supervisorctl start docker-astute)

Then I was able to start new deployment and then successfully stop it during provisioning.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

unfortunately, it is not so. As you can see in logs i check new version. Maybe we have different scenarios for check?

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

> Maybe we have different scenarios for check?

yes, I tested it only during provisioning (installation of OS) and before the installation is completed. I thought some another scenario is used for clearing drives when OS is installed already =)
As I understand, when 'astute' tries to purge hard drives it gets Input/Output error, because system is loaded from HDD itself. So, I think the order of removing data can be changed to avoid booting from hard drive after caused 'kernel panic':

1. Remove first sectors on each drive and its partitions
2. Purge whole drives (not really, until kernel panic is caused)

I've attached the file I tested on 4 different environments (2 stops during provisioning on CO/Ubnt, 2 stops after provisioning is finished on CO/Ubnt) and it works for me.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Artem Panchenko (apanchenko-8), thanks! I will check it tomorrow.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I check new version of ssh_erase_nodes. Unfortunately, it fail when nodes already provisioned.

I believe that problem can be solved if erase provisioned nodes using Ruby scripts as we do it for erase/reset/stop deployment.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Yes, you are absolutely right. I tested it once again and nodes got stuck when I stop provisioning right after OS installation is completed. I think kernel panic wasn't caused by destroying of filesystem and it just hungs. Just in case I checked the command which is used in Ruby scripts to reset deployment and found, that only first and last blocks of drives are cleared there, so it doesn't destroy filesystem. I tried to implement the same logic in bash script and it also works on provisioning state (even after OS is installed). So, if you have a time, please take a look at the script I attached to this comment.

Btw, I guess there is no need to purge whole drives partitions at all. When new deployment is started the installation of OS will create new filesystem, so I think no errors will occur.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Unfortunately, it is also fail when nodes already provisioned in case already provisioned nodes. Get 2 different error in parallel nodes.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Vladimir, it sounds strange, because I tested it once again and slaves were successfully rebooted to bootstrap. I attached screen shots with my steps. Could you please provide me with exact deployment state when I should stop it to catch filesystem errors? Thanks.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

After an additional investigation I found that sometimes Astute can't reboot node using SysReq over SSH:

http://paste.openstack.org/show/81629/

but previous task (erase_nodes) finished successfuly:

http://paste.openstack.org/show/81638/

and I can access it manually from master node without problems. You can find full logs in attachemnts.

Also, I added the following steps to ssh_erase_nodes.rb to avoid the situation when node filesystem corruption occurs on system drive and node really can't be accessed via SSH due to input/output error:

1. Found all known partitions with filesystems and set them to cause panic in case of errors
2. Remove MBR (including partition table) only and don't destroy filesystems in case of GRUB isn't installed there
3. Flush filesytem buffers before exiting

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Probably, I found the reason of "inaccessible" error while trying reboot. I guess it's because Astute removes node from cobbler first and node's hostname becomes unresolvable:

http://paste.openstack.org/show/81649/

I was able to catch such error only on "fast" hardware (with SSD drives) and on "slow" environment "reboot after erasing" works fine.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

> I guess it's because Astute removes node from cobbler first and node's hostname becomes unresolvable

Good point. I think it really can be so. We can use ip instead of hostname to avoid this problem.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Nikolay, please add a information about admin ip for nodes in the stop_provisining task.

This need to remove situation with possible race condition with nodes hostname.

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Nikolay Markov (nmarkov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/96116

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/96488

Changed in fuel:
assignee: Nikolay Markov (nmarkov) → Vladimir Sharshov (vsharshov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/96554

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/96116
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=fbf4a1cc7b891e84f78cc01faec43671f3880efc
Submitter: Jenkins
Branch: master

commit fbf4a1cc7b891e84f78cc01faec43671f3880efc
Author: Nikolay Markov <email address hidden>
Date: Wed May 28 13:31:28 2014 +0400

    Pass admin IP on stop_deployment

    Change-Id: Ib113894aefbe1f60da1303abb4c736f42216e65d
    Related-Bug: #1316583

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/96488
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=1829a9adbe69cc1b50403929a737cadbcbbba0ec
Submitter: Jenkins
Branch: master

commit 1829a9adbe69cc1b50403929a737cadbcbbba0ec
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 29 18:01:18 2014 +0400

    Avoid race condition with hostname declaration

    When we remove nodes from Cobbler, we lose access to this
    node using hostnames.

    Change-Id: Ic52cbd7562db91b9ea01fab30c56054c5253a93b
    Related-Bug: #1316583

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (stable/5.0)

Related fix proposed to branch: stable/5.0
Review: https://review.openstack.org/102768

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (stable/5.0)

Related fix proposed to branch: stable/5.0
Review: https://review.openstack.org/102769

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (stable/5.0)

Reviewed: https://review.openstack.org/102768
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=9ab830ab3f90ede8b8bc2e0fffa663f339322079
Submitter: Jenkins
Branch: stable/5.0

commit 9ab830ab3f90ede8b8bc2e0fffa663f339322079
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 29 18:01:18 2014 +0400

    Avoid race condition with hostname declaration

    When we remove nodes from Cobbler, we lose access to this
    node using hostnames.

    Change-Id: Ic52cbd7562db91b9ea01fab30c56054c5253a93b
    Related-Bug: #1316583
    (cherry picked from commit 1829a9adbe69cc1b50403929a737cadbcbbba0ec)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (stable/5.0)

Reviewed: https://review.openstack.org/102769
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=dd7f32ab80c023a4afda70b521dd5391e5e464fd
Submitter: Jenkins
Branch: stable/5.0

commit dd7f32ab80c023a4afda70b521dd5391e5e464fd
Author: Nikolay Markov <email address hidden>
Date: Wed May 28 13:31:28 2014 +0400

    Pass admin IP on stop_deployment

    Change-Id: Ib113894aefbe1f60da1303abb4c736f42216e65d
    Related-Bug: #1316583
    (cherry picked from commit fbf4a1cc7b891e84f78cc01faec43671f3880efc)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/103352

Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Marked as "Fixed in 5.0.1" in 5.0.1 Release Notes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/96554
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=5fa18e8e0873dd2455a984014b7b29a2382cfd0b
Submitter: Jenkins
Branch: master

commit 5fa18e8e0873dd2455a984014b7b29a2382cfd0b
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 29 22:49:01 2014 +0400

    Erase provisioned node when cancel provisioning

    * always erase node in boostrap state (failsafe optimization);
    * do erase using shell script nodes in provisioned/boostrap state;
    * for provisioned/boostrap state use mcollective agent.

    Change-Id: I2a3df52920f57f9c66e237de0d0d48a814ebf409
    Related-Bug: #1316583
    Closes-Bug: #1322573

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (stable/5.0)

Related fix proposed to branch: stable/5.0
Review: https://review.openstack.org/105260

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Vladimir, Artem, please update the status of this bug. There's been a number of patches marked with Related-Bug merged and one more is currently outstanding. Are patches merged so far sufficient to close it? Is the latest patch also required? Is it sufficient or yet more patches are needed?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (stable/5.0)

Reviewed: https://review.openstack.org/105260
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=a4edb51661f50c66e247e0b8d00f2d01e0658fe6
Submitter: Jenkins
Branch: stable/5.0

commit a4edb51661f50c66e247e0b8d00f2d01e0658fe6
Author: Vladimir Sharshov <email address hidden>
Date: Thu May 29 22:49:01 2014 +0400

    Erase provisioned node when cancel provisioning

    * always erase node in boostrap state (failsafe optimization);
    * do erase using shell script nodes in provisioned/boostrap state;
    * for provisioned/boostrap state use mcollective agent.

    Change-Id: I2a3df52920f57f9c66e237de0d0d48a814ebf409
    Related-Bug: #1316583
    Closes-Bug: #1322573

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Dmitry,

to fix the issue, which is described here https://bugs.launchpad.net/fuel/+bug/1321095 we need to merge https://review.openstack.org/#/c/103352/ (fixes stop deployment via SSH during provisioning) and backport it to stable/5.0. Also it seems that https://review.openstack.org/#/c/105459/ patch should be merged to fix 'stop provisioning' feature on 5.0.1 (https://bugs.launchpad.net/fuel/+bug/1339024).

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/103352
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=4bf5312f4752769b51974ee5804dd224b67a0dcf
Submitter: Jenkins
Branch: master

commit 4bf5312f4752769b51974ee5804dd224b67a0dcf
Author: Artem Panchenko <email address hidden>
Date: Sat Jun 28 19:50:12 2014 +0300

    Refactor ssh actions used for node erasing

    Patch fixes few issues with erasing drives while
    stopping deployment during provisioning, e.g.:
     * executing dd with non-existing under
       debootstrap shell option 'flag';
     * broken check of block device major code;
     * incorrect order of arguments, which are passed
       to erase_data function;
    Also it implements more robust mechanizm to detect
    provisining or provisined node.

    Change-Id: Ic6022cc4ecb405a17dbeefb095590532cbbbe33b
    Related-bug: #1316583

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (stable/5.0)

Related fix proposed to branch: stable/5.0
Review: https://review.openstack.org/108188

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-astute (stable/5.0)

Change abandoned by Artem Panchenko (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/108188
Reason: For 5.0.2 we only provide changes for packages and manifests.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.