Compute node HA for ironic doesn't work due to the name duplication of Resource Provider
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
Invalid
|
Critical
|
Dmitry Tantsur | ||
OpenStack Compute (nova) |
Fix Released
|
High
|
John Garbutt | ||
Ocata |
Fix Committed
|
High
|
Jay Pipes | ||
Pike |
Fix Committed
|
High
|
Matt Riedemann |
Bug Description
Description
===========
In an environment where there are multiple compute nodes with ironic driver,
when a compute node goes down, another compute node cannot take over ironic nodes.
Steps to reproduce
==================
1. Start multiple compute nodes with ironic driver.
2. Register one node to ironic.
2. Stop a compute node which manages the ironic node.
3. Create an instance.
Expected result
===============
The instance is created.
Actual result
=============
The instance creation is failed.
Environment
===========
1. Exact version of OpenStack you are running.
openstack-
openstack-
python2-
openstack-
openstack-
openstack-
python-
openstack-
openstack-
openstack-
2. Which hypervisor did you use?
ironic
Details
=======
When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a
new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node UUID)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute.
When a new instance is requested, the scheduler gets only an old resource provider for the ironic node[4]. Then, the ironic node is not selected:
WARNING nova.scheduler.
[1] https:/
[2] https:/
[3] https:/
[4] https:/
tags: | added: ironic placement |
description: | updated |
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → High |
Changed in nova: | |
assignee: | John Garbutt (johngarbutt) → Dmitry Tantsur (divius) |
Changed in ironic: | |
status: | Triaged → In Progress |
assignee: | nobody → Dmitry Tantsur (divius) |
Changed in nova: | |
assignee: | Dmitry Tantsur (divius) → Matt Riedemann (mriedem) |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → John Garbutt (johngarbutt) |
Changed in ironic: | |
status: | In Progress → Invalid |
This isn't the first time we've seen something like this. I wonder if we should think about what the impact would be if we removed the uniq requirement on the name field of a resource provider. It seems like it will inevitably cause problems as people/services start doing things with placement that span arbitrary boundaries (like time in this case) that matter to the client side, but are meaningless to placement.