2020-06-24 17:35:46 |
Dan Smith |
bug |
|
|
added bug |
2020-06-24 18:02:36 |
Dan Smith |
summary |
Interrupted copy-to-store may corrupt a subsequent operation |
Interrupted copy-to-store may break a subsequent operation or worse |
|
2020-06-24 18:06:51 |
Dan Smith |
description |
This is a hypothetical (but very possible) scenario that will result in a corrupted image stored by glance. I don't have code to reproduce it, but discussion seems to indicate that it is possible.
Scenario:
1. Upload image to glance to one store, everything is good
2. Start an image_import(method='copy-to-store') to copy the image to another store
3. Power failure, network failure, or `killall -9 glance-api`
4. After the failure, re-request the copy-to-store
5. That glance worker will see the residue of the image in the staging directory, which is only partial because the process never finished, and will start uploading that to the new store
6. Upon completion, the image will appear in two stores, but one of them will be quietly corrupted |
Consider this scenario:
1. Upload image to glance to one store, everything is good
2. Start an image_import(method='copy-to-store') to copy the image to another store
3. Power failure, network failure, or `killall -9 glance-api`
4. After the failure, re-request the copy-to-store
At this point, one of two cases will happen (we think) depending on the copy request:
5a. If all_stores_must_succeed=False, then we will see the partial staging residue, try to copy it to the store
6a. After we copy what was in the staging area to the new store, we will compare the size to that of the actual image, see that it is wrong and fail the operation
7a. The residue in the staging area will be deleted, but the storage on the backend will neither be updated in locations nor deleted, which is a LEAK (bad).
8a. The user could retry and it should succeed this time because the staging residue is gone, but the storage was leaked in the above step.
the other option is:
5b. If all_stores_must_succeed=True, then we will see the partial staging residue, try to copy it to the store
6b. After we copy what was in the staging area to the new store and compare the size, we will fail the operation
7b. We will not delete the residue from the staging dir, but _will_ delete the backend storage, avoiding the leak.
8b. The user will retry, which will repeat the same and fail again, over and over. |
|
2020-06-24 18:10:03 |
Abhishek Kekane |
description |
Consider this scenario:
1. Upload image to glance to one store, everything is good
2. Start an image_import(method='copy-to-store') to copy the image to another store
3. Power failure, network failure, or `killall -9 glance-api`
4. After the failure, re-request the copy-to-store
At this point, one of two cases will happen (we think) depending on the copy request:
5a. If all_stores_must_succeed=False, then we will see the partial staging residue, try to copy it to the store
6a. After we copy what was in the staging area to the new store, we will compare the size to that of the actual image, see that it is wrong and fail the operation
7a. The residue in the staging area will be deleted, but the storage on the backend will neither be updated in locations nor deleted, which is a LEAK (bad).
8a. The user could retry and it should succeed this time because the staging residue is gone, but the storage was leaked in the above step.
the other option is:
5b. If all_stores_must_succeed=True, then we will see the partial staging residue, try to copy it to the store
6b. After we copy what was in the staging area to the new store and compare the size, we will fail the operation
7b. We will not delete the residue from the staging dir, but _will_ delete the backend storage, avoiding the leak.
8b. The user will retry, which will repeat the same and fail again, over and over. |
Consider this scenario:
1. Upload image to glance to one store, everything is good
2. Start an image_import(method='copy-image') to copy the image to another store
3. Power failure, network failure, or `killall -9 glance-api`
4. After the failure, re-request the copy-to-store
At this point, one of two cases will happen (we think) depending on the copy request:
5a. If all_stores_must_succeed=False, then we will see the partial staging residue, try to copy it to the store
6a. After we copy what was in the staging area to the new store, we will compare the size to that of the actual image, see that it is wrong and fail the operation
7a. The residue in the staging area will be deleted, but the storage on the backend will neither be updated in locations nor deleted, which is a LEAK (bad).
8a. The user could retry and it should succeed this time because the staging residue is gone, but the storage was leaked in the above step.
the other option is:
5b. If all_stores_must_succeed=True, then we will see the partial staging residue, try to copy it to the store
6b. After we copy what was in the staging area to the new store and compare the size, we will fail the operation
7b. We will not delete the residue from the staging dir, but _will_ delete the backend storage, avoiding the leak.
8b. The user will retry, which will repeat the same and fail again, over and over. |
|
2020-06-24 18:10:18 |
Abhishek Kekane |
summary |
Interrupted copy-to-store may break a subsequent operation or worse |
Interrupted copy-image may break a subsequent operation or worse |
|
2020-06-24 18:37:00 |
Abhishek Kekane |
glance: importance |
Undecided |
High |
|
2020-06-25 16:22:53 |
Abhishek Kekane |
summary |
Interrupted copy-image may break a subsequent operation or worse |
Interrupted copy-image may break a subsequent operation |
|
2020-06-25 17:19:32 |
Erno Kuvaja |
nominated for series |
|
glance/ussuri |
|
2020-06-25 17:19:32 |
Erno Kuvaja |
bug task added |
|
glance/ussuri |
|
2020-06-25 17:19:32 |
Erno Kuvaja |
nominated for series |
|
glance/victoria |
|
2020-06-25 17:19:32 |
Erno Kuvaja |
bug task added |
|
glance/victoria |
|
2020-06-25 17:32:04 |
Erno Kuvaja |
glance/ussuri: importance |
Undecided |
High |
|
2020-07-06 05:39:06 |
OpenStack Infra |
glance: status |
In Progress |
Fix Released |
|
2020-07-06 16:38:05 |
OpenStack Infra |
glance/ussuri: status |
In Progress |
Fix Committed |
|