Auditor logs "No such file" errors when renaming corrupt EC fragments

Bug #1639244 reported by Alistair Coles
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned

Bug Description

admin6 reported in irc [1]:

"thanks for the backport to mitaka for checking EC fragments metadata. I applied it, and it seems to work, but sometimes, I have errors on os.rename with no such file or directory : "

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-swift/%23openstack-swift.2016-11-04.log.html#t2016-11-04T09:59:42

logs copied from http://paste.openstack.org/show/587879/ :

Nov 4 11:26:31 STACO4 object-auditor: Quarantined object /srv/node/s04z4ecd05/objects-1/175477/a2d/ab5d60b03afec2e9a0861b6817f14a2d/1470560571.46381#2.data: Invalid EC metadata at offset 0x0
Nov 4 11:26:31 STACO4 object-auditor: ERROR Object /srv/node/s04z4ecd05/objects-1/175477/a2d/ab5d60b03afec2e9a0861b6817f14a2d failed audit and was quarantined: Invalid EC metadata at offset 0x0
Nov 4 11:26:31 STACO4 object-auditor: ERROR DiskFile /srv/node/s04z4ecd05/objects-1/175477/a2d/ab5d60b03afec2e9a0861b6817f14a2d/1470560571.46381#2.data close failure: [Errno 2] No such file or directory : File "/usr/bin/swift-object-auditor", line 29, in <module>#012 run_daemon(ObjectAuditor, conf_file, **options)#012 File "/usr/lib/python2.7/dist-packages/swift/common/daemon.py", line 110, in run_daemon#012 klass(conf).run(once=once, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/common/daemon.py", line 57, in run#012 self.run_forever(**kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 393, in run_forever#012 self.audit_loop(parent, zbo_fps, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 367, in audit_loop#012 pids.append(self.fork_child(**kwargs))#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 314, in fork_child#012 self.run_audit(**kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 302, in run_audit#012 worker.audit_all_objects(mode=mode, device_dirs=device_dirs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 129, in audit_all_objects#012 self.failsafe_object_audit(location)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 212, in failsafe_object_audit#012 self.object_audit(location)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 244, in object_audit#012 for chunk in reader:#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1600, in __iter__#012 self.close()#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1800, in close#012 {'exc': e, 'stack': ''.join(traceback.format_stack()),#012
Nov 4 11:27:09 STACO4 object-auditor: Quarantined object /srv/node/s04z4ecd03/objects-1/182607/735/b253de29d4eade286b440605b3243735/1468760857.79624#5.data: Invalid EC metadata at offset 0x1c76e
Nov 4 11:27:09 STACO4 object-auditor: ERROR Object /srv/node/s04z4ecd03/objects-1/182607/735/b253de29d4eade286b440605b3243735 failed audit and was quarantined: Invalid EC metadata at offset 0x1c76e
Nov 4 11:27:58 STACO4 object-auditor: Quarantined object /srv/node/s04z4ecd05/objects-1/175696/e8c/ab941e0c6512bd5ef51294ac96a6ee8c/1473894901.93179#7.data: Invalid EC metadata at offset 0x0
Nov 4 11:27:58 STACO4 object-auditor: ERROR Object /srv/node/s04z4ecd05/objects-1/175696/e8c/ab941e0c6512bd5ef51294ac96a6ee8c failed audit and was quarantined: Invalid EC metadata at offset 0x0
Nov 4 11:27:58 STACO4 object-auditor: ERROR Trying to audit /srv/node/s04z4ecd05/objects-1/175696/e8c/ab941e0c6512bd5ef51294ac96a6ee8c: #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 212, in failsafe_object_audit#012 self.object_audit(location)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/auditor.py", line 244, in object_audit#012 for chunk in reader:#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1586, in __iter__#012 self._update_checks(chunk)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 2581, in _update_checks#012 self._check_frag(self.frag_buf[cursor:cursor + self.frag_size])#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 2562, in _check_frag#012 self._quarantine(msg)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1762, in _quarantine#012 self._data_file)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 3295, in run_in_thread#012 result = func(*args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 455, in wrapper#012 return f(*args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 220, in quarantine_renamer#012 renamer(from_dir, to_dir, fsync=False)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1094, in renamer#012 os.rename(old, new)#012OSError: [Errno 2] No such file or directory
Nov 4 11:29:00 STACO4 object-auditor: Quarantined object /srv/node/s04z4ecd04/objects-1/175123/d61/ab04ee3d5ed5eb9ea95dde279c595d61/1471680446.72800#1.data: Invalid EC metadata at offset 0x0
Nov 4 11:29:00 STACO4 object-auditor: ERROR Object /srv/node/s04z4ecd04/objects-1/175123/d61/ab04ee3d5ed5eb9ea95dde279c595d61 failed audit and was quarantined: Invalid EC metadata at offset 0x0
```

Tags: ec
Revision history for this message
Alistair Coles (alistair-coles) wrote :

better formatted logs here http://paste.openstack.org/show/587879/

Revision history for this message
Alistair Coles (alistair-coles) wrote :

admin6 reported that the frags referred to in the logs do appear in quarantine directories.

Some of the reported error logs are raised from the ECDiskFileReader close() method. Some are raised from check_frags called during the reader iter.

Kota and I suspect that two calls are being made to the quarantine renamer, the second of which would fail with "No such file" error. But we could not see how that could happen in code paths.

If a chunk is read in ECDiskFileREader iter and check_frag finds corruption then the file will be quarantined, but reader._read_to_eof will be false so when closing the reader there will be no further call to check_frag.

Conversely, if the iter reaches end of file without a corruption being found, then the close() method will call check_frag with any remaining frag buf contents.

But, we couldn't see how both could occur.

description: updated
description: updated
description: updated
Changed in swift:
importance: Undecided → Medium
importance: Medium → Undecided
Revision history for this message
Alistair Coles (alistair-coles) wrote :

Noting that the logs indicate that the *auditor* appears to successfully quarantine the object and then the *auditor* has the "No such file" error raised, so this is not a race between the auditor and another process (unless it is a second auditor process)

Revision history for this message
clayg (clay-gerrard) wrote :

could be the two auditor workers, and in fact I think that might be likely

again I saw this once in development [1], but am not able to reproduce it reliably...

here's what I'm using to corrupt frags on disk:

https://gist.github.com/clayg/5ed8bc4286f74107bb2fecbdb6fb57cb

I've tried to also trigger the zbf content-length check on the same file; but I'm not sure how to get them to race reliably or even if that's a likely explanation for admin6's observations.

1. same gist https://gist.github.com/clayg/5ed8bc4286f74107bb2fecbdb6fb57cb has a traceback with ENOENT in quarantine, but it was unrelated

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.