Different reclaim ages for accounts and containers can result in un-reclaimable containers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
In Progress
|
Undecided
|
Donagh McCabe |
Bug Description
We have a system where the reclaim_age on accounts was lower than the reclaim_age on containers. We have found a number of container databases that have not been reclaimed. In all cases:
- There is a single copy of the container database (i.e.. the other two copies have been reclaimed)
- There are no account databases (i.e. all reclaimed)
- The reported_
It is thought you can get into this state as follows:
1/ Account is deleted
2/ The objects and then containers are deleted. Everything is in expected
states -- specifically, the container's reported_
is the same as delete_timestamp.
3/ Container server A reaches reclaim age: deletes the container
database.
4/ Server B or C (assuming 3 replicas) runs container-
restores the database to server A
5/ The account database has already been reclaimed (see below).
6/ The container-updater on server A cannot push the container data to the
account so the reported_
7/ The reclaim age on server B and C is reached so they delete their copies
The race in steps 3 and 4 does not always occur (because step 7 happens), so not all containers are left behind. This race may happen a lot on a normal system (where "normal" means same reclaim_age) , but will not be noticed because the container-
The effect of the reclaimed container database is two fold:
- Takes up disk space
- The account-server continues to get updates from the container-
We are debugging a proposed solution whereby we will reclaim after reclaim_age* 4. ie., if the container-updater cannot push stats to the account after a month of trying, it's probably never going to work so we might as well give up.
Changed in swift: | |
assignee: | nobody → Donagh McCabe (donagh-mccabe) |
status: | New → In Progress |
I don't understand. Why aren't the container's reported_ timestamps getting updated during the reclaim period? The account server should respond 2XX to PUT container requests with x-account- override- deleted. Is the container-updater not able to cycle within the account's reclaim age?
If we're trying to handle the container-updater hitting an account that has already been reclaimed then the account server needs to catch the Exception coming out of account_ broker. put_container and handle it. I think that it should be the job of the container-updater to recognize when is the right time to "give up" on updating the account's. For example; perhaps when the majority of account servers respond 404 (or whatever response is returned from the exception handler for put_container) and broker.is_deleted and time() - delete_timestamp > reclaim_age.