reinstalling a compute node and then upgrading from pike to queens fails

Bug #1762368 reported by Junien Fridrick
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
New
Undecided
Unassigned

Bug Description

Hi,

I had a working xenial/pike cloud recently, using neutron-ovs, with some compute nodes, in particular a ppc64 compute node named bagon. I needed to reinstall it, so I did the following :

1. nova service-delete <id of the compute service on bagon>
2. neutron agent-delete <uuid of the openvswitch agent on bagon>
3. Re-commission the node and deploy the nova-compute application on it

After what, some times later, I upgraded the cloud to queens. This apparently caused the node to stop working. It was logging the following error (nova-compute.log on bagon) :

2018-04-09 06:25:26.099 128068 ERROR nova.scheduler.client.report [req-f1eebe14-fcfb-4878-b557-50105790d3b5 6bd667e324ea463abaacbc1f9c3bbed3 95cafd7ede504ef6b7b67ead691d3883 - default default] [req-29de76b9-50c2-4bff-85a9-363d665c250f] Failed to create resource provider record in placement API for UUID 2d236848-df06-47f1-92a4-a1afefe62931. Got 409: {"errors": [{"status": 409, "request_id": "req-29de76b9-50c2-4bff-85a9-363d665c250f", "detail": "There was a conflict when trying to complete your request.\n\n Conflicting resource provider name: bagon.fqdn already exists. ", "title": "Conflict"}]}.

Full stack trace : https://pastebin.canonical.com/p/ynhpgsB8bp/ (sorry, Canonical-only link)

I tracked down the problem, and found it was due to the following mismatch :

mysql> select uuid,host,deleted from compute_nodes where host='bagon';
+--------------------------------------+-------+---------+
| uuid | host | deleted |
+--------------------------------------+-------+---------+
| 2d236848-df06-47f1-92a4-a1afefe62931 | bagon | 0 |
| 92232041-9767-466b-a82f-20ecef0af6fa | bagon | 9 |
+--------------------------------------+-------+---------+
2 rows in set (0.00 sec)

mysql> use nova_api;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select uuid,name from resource_providers where name like 'bagon%';
+--------------------------------------+--------------------------+
| uuid | name |
+--------------------------------------+--------------------------+
| 92232041-9767-466b-a82f-20ecef0af6fa | bagon.fqdn |
+--------------------------------------+--------------------------+
1 row in set (0.00 sec)

The nova.compute_nodes table has 2 records for bagon, as expected : one is the old, deleted record and the other the current, live record.

The problem, as you can see above, is that the nova_api.resource_providers table had the old UUID for bagon. I'm not exactly sure at what point nova-compute on bagon started failing, I'm fairly confident it was OK after the reinstall, so I suspect something happened during the migration from pike to queens.

I manually updated the UUID in the resource_providers table, and bagon started working fine.

I can't try to repro because I can't downgrade the cluster to try the pike=>queens upgrade a second time, but hopefully you can.

Thanks !

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.