Most instance actions can be called concurrently
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Low
|
Matthew Booth | ||
Queens |
Fix Committed
|
Low
|
Matthew Booth | ||
Rocky |
Fix Released
|
Low
|
Matthew Booth | ||
Stein |
Fix Released
|
Low
|
Matthew Booth |
Bug Description
A customer reported that they were getting DB corruption if they called shelve twice in quick succession on the same instance. This should be prevented by the guard in nova.API.shelve, which does:
instance.
instance.
This is intended to act as a robust gate against 2 instance actions happening concurrently. The first will set the task state to SHELVING, the second will fail because the task state is not SHELVING. The comparison is done atomically in db.instance_
However, instance.save() shortcuts if there is no update and does not call db.instance_
instance = get_instance()
=> Returned instance.task_state is None
instance.
instance.
=> task_state was None, now SHELVING, updates = {'task_state': SHELVING}
=> db.instance_
instance = get_instance()
=> Returned instance.task_state is SHELVING
instance.
instance.
=> task_state was SHELVING, still SHELVING, updates = {}
=> db.instance_
This pattern is common to almost all instance actions in nova api. A quick scan suggests that all of the following actions are affected by this bug, and can therefore all potentially be executed multiple times concurrently for the same instance:
restore
force_stop
start
backup
snapshot
soft reboot
hard reboot
rebuild
revert_resize
resize
shelve
shelve_offload
unshelve
pause
unpause
suspend
resume
rescue
unrescue
set_admin_password
live_migrate
evacuate
tags: | added: api |
tags: | added: race-condition |
Changed in nova: | |
importance: | Undecided → Low |
I intend to fix this by changing instance.save() such that it will:
1. return the expected InstanceUpdateC onflict directly without hitting the db if instance.task_state doesn't match expected_ task_state.
2. assert that task state is being changed if expected_task_state is set (this would be a programming error).
3. ditto for vm_state.
I believe unit test coverage in test_instance will be sufficient for this change.