Baremetal instance stuck in BUILD state following ironic node tear down or delete
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Description
===========
A baremetal (ironic) instance can become stuck in the BUILD state if the ironic node to which the instance has been assigned is either deleted or torn down manually while the instance is being built.
Steps to reproduce
==================
* Create a nova instance that will be scheduled onto baremetal.
* Determine to which node the instance has been scheduled via 'openstack baremetal node show --instance <instance UUID>'
* Wait for the ironic node to enter the 'wait call-back' state.
* Tear down the node manually via 'openstack baremetal node undeploy <node>'
Expected results
================
The ironic node becomes 'available'. The nova instance detects the change in ironic, cleans up, and moves to an ERROR state.
Actual results
==============
The ironic node becomes 'available'. The nova instance detects the change in ironic, cleans up the instance's networks, and stays in the BUILD state.
Environment
===========
Pike, deployed using kolla-ansible on CentOS host with RDO packages in CentOS containers.
openstack-
Thoughts
========
I believe this is happening because the nova ironic virt driver raises InstanceNotFound [1][2] when the ironic node is deleted or torn down. The nova compute manager [3] interprets this as meaning the Nova instance was deleted, and therefore does not change the instance's state as there should be no instance to change.
[1] https:/
[2] https:/
[3] https:/
Does seem like [2] should really be raising build failed, like the previous step.
The problem is comparing when you delete and instance via the API (and making sure we correctly keep the instance deleted) vs failing due to something else doing the delete during a build.
Either way, stuck in building state is the worst possible outcome.