Reading through this, this bug looks valid to me. Setting to Medium based on the fact that a user has to delete an ironic node out-of-band from nova while the nova instance is building in order to encounter the bug.
The analysis in the bug report makes sense, that the ironic driver is raising InstanceNotFound when calling ironic after the baremetal node was deleted out-of-band of nova. Nova treats it as "nova instance not found" and thus thinks there's nothing to do with the instance state.
I do wonder if it would be correct for the ironic driver to instead raise InstanceDeployFailure (or another new exception such as IronicNodeNotFound) if an ironic node GET call returns 404. I can't think of a reason the ironic driver should raise InstanceNotFound unless it has deleted the nova instance itself.
This idea is based on looking at how we handle a delete via the nova API while a baremetal instance is building. While the instance is building in the driver spawn method, the _wait_for_active [1] method is looping. If a user requests a delete of the instance, the driver loop during build will see the task_state == DELETING and will raise InstanceDeployFailure as a result.
Then, the compute manager doesn't handle the InstanceDeployFailure exception [2] and will raise the RescheduledException. The RescheduledException will be caught [3] and when retries are exceeded, the networks/volumes will be cleaned up and the instance set to ERROR state.
Reading through this, this bug looks valid to me. Setting to Medium based on the fact that a user has to delete an ironic node out-of-band from nova while the nova instance is building in order to encounter the bug.
The analysis in the bug report makes sense, that the ironic driver is raising InstanceNotFound when calling ironic after the baremetal node was deleted out-of-band of nova. Nova treats it as "nova instance not found" and thus thinks there's nothing to do with the instance state.
I do wonder if it would be correct for the ironic driver to instead raise InstanceDeployF ailure (or another new exception such as IronicNodeNotFound) if an ironic node GET call returns 404. I can't think of a reason the ironic driver should raise InstanceNotFound unless it has deleted the nova instance itself.
This idea is based on looking at how we handle a delete via the nova API while a baremetal instance is building. While the instance is building in the driver spawn method, the _wait_for_active [1] method is looping. If a user requests a delete of the instance, the driver loop during build will see the task_state == DELETING and will raise InstanceDeployF ailure as a result.
Then, the compute manager doesn't handle the InstanceDeployF ailure exception [2] and will raise the RescheduledExce ption. The RescheduledExce ption will be caught [3] and when retries are exceeded, the networks/volumes will be cleaned up and the instance set to ERROR state.
[1] https:/ /github. com/openstack/ nova/blob/ 6bf11e1dc14afad 78b11d980c2544a 3dc41579ff/ nova/virt/ ironic/ driver. py#L466- L469 /github. com/openstack/ nova/blob/ 6bf11e1dc14afad 78b11d980c2544a 3dc41579ff/ nova/compute/ manager. py#L2199 /github. com/openstack/ nova/blob/ 6bf11e1dc14afad 78b11d980c2544a 3dc41579ff/ nova/compute/ manager. py#L1932
[2] https:/
[3] https:/