In testing the patch I proposed, it seems to help but we still seem to orphan some ports... although this does appear to be a case of resource exhaustion where a failed deploy can't be replaced because the number of instances requested exactly matches the number of available baremetal nodes. On a plus side, it is failing at that very last node now which is a positive sign for the patch in my mind. (tl;dr I'm testing in an environment with BMC issues which can cause deployments to baremetal nodes to fail quite reliably) :|
Is the behavior to wait on deleting the failed instance tunable at all? That would seemingly address the issue we're encountering.
In testing the patch I proposed, it seems to help but we still seem to orphan some ports... although this does appear to be a case of resource exhaustion where a failed deploy can't be replaced because the number of instances requested exactly matches the number of available baremetal nodes. On a plus side, it is failing at that very last node now which is a positive sign for the patch in my mind. (tl;dr I'm testing in an environment with BMC issues which can cause deployments to baremetal nodes to fail quite reliably) :|
Is the behavior to wait on deleting the failed instance tunable at all? That would seemingly address the issue we're encountering.