Reverting migration-based allocations leaks allocations if the server is deleted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Triaged
|
Medium
|
Unassigned | ||
Queens |
New
|
Undecided
|
Unassigned | ||
Rocky |
New
|
Undecided
|
Unassigned | ||
Stein |
New
|
Undecided
|
Unassigned | ||
Train |
New
|
Undecided
|
Unassigned |
Bug Description
This came up in the cross-cell resize review:
https:/
And I was able to recreate with a functional test here:
https:/
That test is doing a cross-cell cold migration but looking at the code:
We can hit an issue for same-cell resize/cold migrate if we have swapped the allocations so the source node allocations are held by the migration consumer and the instance holds allocations on the target node (created by the scheduler):
If something fails between ^ and the cast to prep_resize, the task will rollback and revert the allocations so the target node allocations are dropped and the source node allocations are moved back to the instance:
Furthermore, if the instance was deleted when we perform that swap, the move_allocations method will recreate the allocations on the source node for the now-deleted instance since we don't assert consumer generations during the swap:
This results in leaking allocations for the source node since the instance is deleted.
Changed in nova: | |
status: | New → Triaged |
summary: |
- MigrationTask rollback can leak allocations for a deleted server + Reverting migration-based allocations leaks allocations if the server is + deleted |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → nobody |
status: | In Progress → Triaged |
Note that we could have the same issue in the compute service, for example if the server is deleted during the resize claim and we get to this exception block handler:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4724
To revert the allocations here:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4574
Which calls move_allocations which has the same problem described above.
We could also have the same issue if the instance is gone during resize_instance on the source host:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4896
Or during finish_ revert_ resize I guess:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4454
I'm not sure about the fix yet, but we might want callers to optionally tell the move_allocations method if it should require the usage of the target consumer (instance in this case) generation so we don't use a .get() here:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/scheduler/ client/ report. py#L1886
Actually if the target consumer no longer exists in placement the 'consumer_ generation' key won't exist in that allocations response and we'd have to handle it earlier, like this:
https:/ /review. opendev. org/#/c/ 688832/ 2/nova/ scheduler/ client/ report. py