Frequent instances stuck in BUILD with no apparent failure
Bug #1854992 reported by
Erik Olof Gunnar Andersson
This bug affects 3 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
We are getting frequent instances stuck indefinitely in BUILD without an error message. This seems to be triggered by high concurrency (e.g. build a lot of instances with terraform).
We have multiple synthetic instances that are being built and destroyed ever 10 minutes and they never hit this issue.
This is running one commit behind of the latest stable/rocky branch.
To post a comment you must log in.
Right now we are suspecting that this is caused by issues with RabbitMQ. If a message for some reason isn't delivered to the compute (e.g. broken bindings) the instance will never be built and indefinitely stuck in BUILD.
Even worse is that there is nothing in the logs, and there is nothing in the api that would indicate the compute with issues. It would be helpeful to set the compute before the message is sent to the compute, or maybe change the RabbitMQ message to be an RPC, so that it will timeout.
Another idea would be to make the message sent mandatory.