RSYNC: Probable race condition in replication/reconstruction can lead to loss of datafile
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Confirmed
|
High
|
Unassigned |
Bug Description
Related bug: 1897177
During a rebalance, it was found a reproducible scenario with SSYNC for both replicator and reconstructor. That scenario leads to datafile loss (see bug 1897177). As the workflow is similar, it is higly probable the same bug applies to RSYNC replication.
The scenario is as follow (revert means replicate then delete a handoff partition):
1. A gets and reloads the new ring
2. A starts to revert the partition P to node B
3. B (with the old ring) starts to revert the (partial) partition P to node A
=> replication should be fast as all objects are already on node A
4. B finished replication of (partial) partition P to node A
5. B remove the (partial) partition P after revert succeeded
6. A finishes revert of partition P to node B
7. A removes the partition P
8. B gets and reloads the new ring
The repro is a little more involved than the reconstructor case, but yeah, this can definitely happen. First up, I hacked up handoffs_first to be handoffs_only
diff --git a/swift/ obj/replicator. py b/swift/ obj/replicator. py .a8891124c 100644 obj/replicator. py obj/replicator. py r(Daemon) :
random. shuffle( jobs) first: key=lambda job: not job['delete'])
self. job_count = len(jobs)
index dcab26fe1.
--- a/swift/
+++ b/swift/
@@ -917,7 +918,8 @@ class ObjectReplicato
if self.handoffs_
# Move the handoff parts to the front of the list
- jobs.sort(
+ jobs = [job for job in jobs if job['delete']]
return jobs
Then widened the race:
diff --git a/swift/ obj/replicator. py b/swift/ obj/replicator. py .a8891124c 100644 obj/replicator. py obj/replicator. py r(Daemon) :
self. logger. timing_ since(' partition. delete. timing' , begin)
index dcab26fe1.
--- a/swift/
+++ b/swift/
@@ -587,6 +587,7 @@ class ObjectReplicato
def delete_ partition( self, path):
self. logger. info(_( "Removing partition: %s"), path)
tpool. execute( shutil. rmtree, path)
+ time.sleep(10)
try:
Letting the replicators run for a bit, I'm down to only two data files.