Failed instances stuck in BUILD state after Rocky upgrade
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Mark Goddard | ||
kolla |
Fix Released
|
High
|
Mark Goddard | ||
Rocky |
Fix Committed
|
High
|
Mark Goddard | ||
Stein |
Fix Released
|
High
|
Mark Goddard | ||
Train |
Fix Released
|
High
|
Mark Goddard |
Bug Description
Steps to reproduce
==================
Starting with a cloud running the Queens release, upgrade to Rocky.
Create a flavor that cannot fit on any compute node, e.g.
openstack flavor create --ram 100000000 --disk 2147483647 --vcpus 10000 huge
Then create an instance using that flavor:
openstack server create huge --flavor huge --image cirros --network demo-net
Expected
========
The instance fails to boot and ends up in the ERROR state.
Actual
======
The instance fails to boot and gets stuck in the BUILD state.
From nova-conductor.log:
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
2019-06-12 15:00:24.443 6 ERROR oslo_messaging.
ated_at, deleted_at, deleted, instance_uuid, device_metadata, numa_topology, pci_requests, flavor, vcpu_model, migration_context, keypairs, trusted_certs) VALUES (%(created_at)s, %(updated_at)s, %(deleted
_at)s, %(deleted)s, %(instance_uuid)s, %(device_
ce_uuid': u'df1bd38c-
ci_requests': '[]', 'vcpu_model': None, 'device_metadata': None, 'created_at': datetime.
': 0, 'migration_
haracters truncated) ... , "swap": 0, "rxtx_factor": 1.0, "is_public": true, "deleted_at": null, "vcpu_weight": 0, "id": 6, "name": "huge"}, "nova_object.
round on this error at: http://
Workaround
==========
On the controller, perform a nova DB sync:
docker exec -it nova_api nova-manage db sync
Despite this making no changes to the database (checked with mysqldump), it seems to 'fix' nova. New instances created using the 'huge' flavor will go to the ERROR state.
Changed in kolla-ansible: | |
milestone: | none → 8.0.0 |
affects: | kolla-ansible → kolla |
Changed in kolla: | |
milestone: | 8.0.0 → none |
no longer affects: | kolla-ansible/rocky |
Changed in kolla: | |
importance: | Undecided → High |
Some things to note:
I'm pretty confident that the DB sync had been run using the rocky nova-api container prior to the upgrade.
The 'missing' trusted_certs column did exist in the instance_extra table in the nova DB prior to performing the workaround DB sync.
No restart of services was necessary.