Race condition on erasure coded ring
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
New
|
Undecided
|
Unassigned |
Bug Description
I'm using Swift v2.7 with an erasure coded ring.
I saw today new error messages in the object-server logs I've never seen before :
Jun 29 12:15:27 STACO2 object-server: Problem writing durable state file /srv/node/
Jun 29 12:15:27 STACO2 object-server: ERROR __call__ error with PUT /s02z2ecd03/
It happened around 10 times during the last 3 hours on different ACO servers and different disks.
However, everytime, the directory and files exists :
root@STACO2:~# ls -al /srv/node/
-rw------- 1 swift swift 389196 Jun 29 12:15 /srv/node/
-rw-r--r-- 1 swift swift 0 Jun 29 12:15 /srv/node/
At the same time, I have this log from another ACO server :
Jun 29 12:15:27 STACO4 object-
I also had different failures and success values in other logs like :
Jun 29 14:45:52 STACO3 object-
Also I don't know if that could be related, but I today had other logs that I've never seen before :
Jun 29 08:11:00 STACO4 object-server: ERROR WSGI: code 400, message Bad request syntax ('0')
Jun 29 08:25:00 STACO1 object-server: ERROR WSGI: code 400, message Bad request syntax ('0')
Jun 29 08:28:00 STACO3 object-server: ERROR container update failed with 10.10.1.
Jun 29 08:32:10 STACO4 object-server: ERROR container update failed with 10.10.1.
Jun 29 08:33:10 STACO3 object-server: ERROR WSGI: code 400, message Bad request syntax ('0')
Jun 29 08:38:10 STACO4 object-server: ERROR container update failed with 10.10.1.
Another thing that could be useful to mention is that I had a failed disk this night on this ring (disk id n°19), that has been automatically unmounted by the swift-drive-audit script.
Here is the ring definition :
262144 partitions, 12.000000 replicas, 1 regions, 4 zones, 28 devices, 0.93 balance, 20.40 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file /home/joomeo/
Devices: id region zone ip address port replication ip replication port name weight partitions balance flags meta
0 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd01 4000.00 114039 0.15
1 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd02 4000.00 114039 0.15
2 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd03 4000.00 114039 0.15
3 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd04 4000.00 114038 0.15
4 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd05 4000.00 114039 0.15
5 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd06 4000.00 114038 0.15
6 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd01 4000.00 114045 0.15
7 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd02 4000.00 114045 0.15
8 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd03 4000.00 114044 0.15
9 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd04 4000.00 114043 0.15
10 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd05 4000.00 114043 0.15
11 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd06 4000.00 114043 0.15
12 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd01 4000.00 114043 0.15
13 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd02 4000.00 114042 0.15
14 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd03 4000.00 114043 0.15
15 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd04 4000.00 114042 0.15
16 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd05 4000.00 114042 0.15
17 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd06 4000.00 114042 0.15
18 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd01 4000.00 113903 0.03
19 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd02 4000.00 113904 0.03
20 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd03 4000.00 113904 0.03
21 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd04 4000.00 113903 0.03
22 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd05 4000.00 113903 0.03
23 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd06 4000.00 113903 0.03
24 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd07 3900.00 110029 -0.90
25 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd07 3900.00 109998 -0.93
26 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd07 3900.00 110006 -0.92
27 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd07 2800.00 79526 -0.23
I can confirm this behavior.
It seems to happen only on process with replication port. The process stop working after logging this error.
My theory: a reconstructor is moving/ reconstructing a fragment. In the mean time, an other process (auditor, object server, ...) think the fragment is invalid because it has no durable file yet. As it's not a fresh object (timestamp < reclaim_age), the object is cleaned before the durable file has been written.
Checking mtime before cleaning reduce a lot the frequency of errors but there is other race condition i didn't figure out.