An object .data file is a dir with another object inside it

Bug #1621255 reported by Darrell Bishop
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Low
Unassigned

Bug Description

I don't know _how_, but I ended up with a valid .data file for an object inside another object's data file (which is actually a directory). No joke.

/srv/node/d8/objects/48059/afd/2eeef28e1d869af3695ec1840d36eafd/1347488008.83489.data/76e66cb1350921594ffe62403dc9f73e/1345262011.15279.data

The object-auditor chokes on this with

ERROR Trying to audit /srv/node/d8/objects/48059/afd/2eeef28e1d869af3695ec1840d36eafd: #012Traceback (most recent call last):#012 File "/usr/lib/pymodules/python2.7/swift/obj/auditor.py", line 214, in failsafe_object_audit#012 self.object_audit(location)#012 File "/usr/lib/pymodules/python2.7/swift/obj/auditor.py", line 237, in object_audit#012 with df.open():#012 File "/usr/lib/pymodules/python2.7/swift/obj/diskfile.py", line 1999, in open#012 self._fp = self._construct_from_data_file(**file_info)#012 File "/usr/lib/pymodules/python2.7/swift/obj/diskfile.py", line 2220, in _construct_from_data_file#012 fp = open(data_file, 'rb')#012IOError: [Errno 21] Is a directory:

instead of, say, quarantining it and getting rid of that `.data` directory.

description: updated
description: updated
Revision history for this message
clayg (clay-gerrard) wrote :

Or this sometimes maybe:

object-auditor: Unexpected file /srv/node/d23/objects/195729/724/bf245d1321483e8f034fb5f4ab5b3724/9d174b820f12f4d3e17a6f70e77c45d5: Invalid Timestamp value in filename '9d174b820f12f4d3e17a6f70e77c45d5'

hashdir in the datadir

Revision history for this message
Kota Tsuyuzaki (tsuyuzaki-kota) wrote :

That looks a significant bug, I didn't get how to reproduce yet too.

From the path string, that looks like it happens "replication" policy? And the valid 1345262011.15279.data file doesn't relate to the dir .data hash. i.e. 2eeef28e1d869af3695ec1840d36eafd != 76e66cb1350921594ffe62403dc9f73e. Curious...

Have you been keeping the environment yet? I'm curious the actual .data file status (in particular, metadata and whether it's corrupted or not) that might be helpful to know how to reproduce that.

Revision history for this message
clayg (clay-gerrard) wrote :

I should clarify! This was almost certainly the result of file system corruption.

These results are not from a "production" environment per se - it's an older set of hardware that last I checked was known to have failing disks - we keep it around specifically to observe the effects of aging dilapidated hardware.

Bit of chaos testing - better to catch these sorts of things now!

Should definitely be a high priority to improve our graceful degradation under this sort of hardware failure tho. The auditor auditor could quarantine; but I'd be hesitant to call it release blocking. OTOH, if ahale confirms it and it's cheap to fix?

Kota, FWIW I do recall the object 76e66cb1350921594ffe62403dc9f73e/1345262011.15279.data was indeed valid according to swift-object-info. I'll try and validate that the named object was healthy in the cluster (presumably in a different filesystem path); and try to track down if an object might exist of have once existed with the 2eeef28e1d869af3695ec1840d36eafd hash and/or possibly still be healthy on another disk/path.

-Clay

Revision history for this message
Andrew Hale (ahale) wrote :

Yeah I also have some objects with unexpected paths on older machines that we've been draining of data and removing from use. I don't really know if it is from hardware failure or unpredictable filesystem behaviour under very high utilisation and lockup/reboot stress.

An example is an object like this,

c1u7/objects/837633/1c4/cc8019c044c38888beb1ef7d41a351c4/1424649379.36464.data/1424663510.19861.data

The actual .data file is mostly valid here, theres no Content-Length, Etag or Content-Type but Path, Account, Container, Object and Object hash are there.

  Object hash: e4e57d9e8f65669c7b2cf3cf3bcad7d1
Content-Type: Not found in metadata
Timestamp: 2015-03-05T08:09:35.446020 (1425542975.44602)
Partition 937559
Hash e4e57d9e8f65669c7b2cf3cf3bcad7d1

Both these partitions have one primary partition location in common at the moment, though not the host that the malformed file was found on. That host has been in common with one of the 837633 machines in an old ring, so I guess this came from disk corruption issues that occured at that time when 837633 and 937559 were together.

Anyway, I don't know that its a release blocking thing, but totally agree that there should be more thorough object-auditing to detect and quarantine this kind of (and other weird, unusable) messed up object paths.

Revision history for this message
Matthew Oliver (matt-0) wrote :

By the sounds of things, this has been confirmed by at least 2 different people from 2 different deployments. Although it seems to be hard to reproduce.

Based on this, lets call it confirmed.. but lower the importance, until it's easily reproduceable. As this seems to be hard to confirm if it's still an issue or not. But assume it is.

Changed in swift:
status: New → Confirmed
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.