database dying can result in FAILED stacks with IN_PROGRESS resources

Bug #1561214 reported by Steve Baker
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Medium
Thomas Herve

Bug Description

Steps to Reproduce:
1. Deploy overcloud, mariadb runs out of file descriptors which causes the deployment to fail, and leaves heat in a bad state.

Running out of file descriptors will be difficult to reproduce. This particular state can be replicated by setting some resources to IN_PROGRESS while their stacks are in an UPDATE_FAILED state.

I'm suggesting a heat-manage command which acts on a single stack and traverses all nested stacks to put any IN_PROGRESS things to FAILED, and clear hooks.

Revision history for this message
Steven Hardy (shardy) wrote :

Is there any less destructive way we can handle this, as all FAILED resources will be replaced, even if they are OK?

I'm thinking something which uses similar logic to stack-check so that it actually observes state rather than unconditionally replacing everything - possibly not enough state to do that safely tho I guess.

Thomas Herve (therve)
Changed in heat:
assignee: nobody → Thomas Herve (therve)
milestone: none → newton-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/305306

Changed in heat:
status: New → In Progress
Thomas Herve (therve)
Changed in heat:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/305306
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=7977f9f2f324f8916433763132c6ee67213e6ed1
Submitter: Jenkins
Branch: master

commit 7977f9f2f324f8916433763132c6ee67213e6ed1
Author: Thomas Herve <email address hidden>
Date: Wed Apr 13 14:38:59 2016 +0200

    Add command to reset one stack status

    Adds a new heat-manage reset_stack_status to recover from specific
    crashes that leaves resources in progress. It removes resource hooks and
    stack locks as well.

    Closes-Bug: #1561214
    Change-Id: I70fa5857c959bc5f1424d562ff8b7740331b5328

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/heat 7.0.0.0b1

This issue was fixed in the openstack/heat 7.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.