Race condition on erasure coded ring

Bug #1597429 reported by Admin6
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned

Bug Description

I'm using Swift v2.7 with an erasure coded ring.

I saw today new error messages in the object-server logs I've never seen before :

Jun 29 12:15:27 STACO2 object-server: Problem writing durable state file /srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690.44522.durable ([Errno 2] No such file or directory: '/srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690.44522.durable'): #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 2517, in _finalize_durable#012 with open(durable_file_path, 'w') as _fp:#012IOError: [Errno 2] No such file or directory: '/srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690.44522.durable'

Jun 29 12:15:27 STACO2 object-server: ERROR __call__ error with PUT /s02z2ecd03/237162/AUTH_Joomeo/20160602_O0/53848/11959730/original.JPG : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 1032, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1412, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 764, in PUT#012 writer.commit(request.timestamp)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 2555, in commit#012 self._finalize_durable, durable_file_path)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 3342, in force_run_in_thread#012 return self._run_in_eventlet_tpool(func, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 3322, in _run_in_eventlet_tpool#012 raise result#012DiskFileError: Problem writing durable state file /srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690.44522.durable ([Errno 2] No such file or directory: '/srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690.44522.durable')

It happened around 10 times during the last 3 hours on different ACO servers and different disks.

However, everytime, the directory and files exists :

root@STACO2:~# ls -al /srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690*
-rw------- 1 swift swift 389196 Jun 29 12:15 /srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690.44522#7.data
-rw-r--r-- 1 swift swift 0 Jun 29 12:15 /srv/node/s02z2ecd03/objects-1/237162/8d4/e79a86e8d6263576adb7ec7aa9ff08d4/1466543690.44522.durable

At the same time, I have this log from another ACO server :

Jun 29 12:15:27 STACO4 object-reconstructor: 10.10.2.52:6000/s02z2ecd03/237162 Unexpected response: ":ERROR: 500 'ERROR: With :UPDATES: 1 failures to 14 successes'"

I also had different failures and success values in other logs like :

Jun 29 14:45:52 STACO3 object-reconstructor: 10.10.2.54:6000/s04z4ecd04/56589 Unexpected response: ":ERROR: 500 'ERROR: With :UPDATES: 8 failures to 3 successes'"

Also I don't know if that could be related, but I today had other logs that I've never seen before :

Jun 29 08:11:00 STACO4 object-server: ERROR WSGI: code 400, message Bad request syntax ('0')
Jun 29 08:25:00 STACO1 object-server: ERROR WSGI: code 400, message Bad request syntax ('0')
Jun 29 08:28:00 STACO3 object-server: ERROR container update failed with 10.10.1.54:6001/s04z5cad01 (saving for async update later): ConnectionTimeout (0.5s) (txn: tx07870f7df4694a4b82bf4-0057736a6f)
Jun 29 08:32:10 STACO4 object-server: ERROR container update failed with 10.10.1.53:6001/s03z6cad01 (saving for async update later): ConnectionTimeout (0.5s) (txn: tx93f8bc2ad5cb4ce8aa5d2-0057736b6a)
Jun 29 08:33:10 STACO3 object-server: ERROR WSGI: code 400, message Bad request syntax ('0')
Jun 29 08:38:10 STACO4 object-server: ERROR container update failed with 10.10.1.53:6001/s03z3cad01 (saving for async update later): ConnectionTimeout (0.5s) (txn: tx1ac192d4de2243dd8e42d-0057736cd2)

Another thing that could be useful to mention is that I had a failed disk this night on this ring (disk id n°19), that has been automatically unmounted by the swift-drive-audit script.

Here is the ring definition :
262144 partitions, 12.000000 replicas, 1 regions, 4 zones, 28 devices, 0.93 balance, 20.40 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file /home/joomeo/backup/config/SWIFT/etc/swift/object-1.ring.gz is up-to-date
Devices: id region zone ip address port replication ip replication port name weight partitions balance flags meta
             0 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd01 4000.00 114039 0.15
             1 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd02 4000.00 114039 0.15
             2 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd03 4000.00 114039 0.15
             3 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd04 4000.00 114038 0.15
             4 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd05 4000.00 114039 0.15
             5 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd06 4000.00 114038 0.15
             6 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd01 4000.00 114045 0.15
             7 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd02 4000.00 114045 0.15
             8 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd03 4000.00 114044 0.15
             9 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd04 4000.00 114043 0.15
            10 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd05 4000.00 114043 0.15
            11 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd06 4000.00 114043 0.15
            12 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd01 4000.00 114043 0.15
            13 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd02 4000.00 114042 0.15
            14 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd03 4000.00 114043 0.15
            15 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd04 4000.00 114042 0.15
            16 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd05 4000.00 114042 0.15
            17 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd06 4000.00 114042 0.15
            18 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd01 4000.00 113903 0.03
            19 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd02 4000.00 113904 0.03
            20 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd03 4000.00 113904 0.03
            21 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd04 4000.00 113903 0.03
            22 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd05 4000.00 113903 0.03
            23 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd06 4000.00 113903 0.03
            24 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1ecd07 3900.00 110029 -0.90
            25 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2ecd07 3900.00 109998 -0.93
            26 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3ecd07 3900.00 110006 -0.92
            27 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4ecd07 2800.00 79526 -0.23

Revision history for this message
Romain LE DISEZ (rledisez) wrote :

I can confirm this behavior.

It seems to happen only on process with replication port. The process stop working after logging this error.

My theory: a reconstructor is moving/reconstructing a fragment. In the mean time, an other process (auditor, object server, ...) think the fragment is invalid because it has no durable file yet. As it's not a fresh object (timestamp < reclaim_age), the object is cleaned before the durable file has been written.

Checking mtime before cleaning reduce a lot the frequency of errors but there is other race condition i didn't figure out.

Revision history for this message
Alistair Coles (alistair-coles) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.