Container sync stops if object server is down

Bug #1069910 reported by Donagh McCabe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Darrell Bishop

Bug Description

Container sync stops syncing a container if the object-server is down (refusing connection or timeout).

container/sync.py container_sync_row() calls direct_get_client() for each object copy. If an object-server is down, this fails with connection refused and ends up in an outer exception handler -- which returns False. This means that the container cannot be synced once an object server goes down. In a large system, the chances of one object server being down is high. Any non-trivial sized container is bound to have some object residing on every node in the system. Hence it's easy for container sync to stop working.

The relevant code looks like. The suggested fix is marked with + (i.e., catch Exception and Timeout *here* rather than in the outer loop)

                for node in nodes:
                    try:
                        these_headers, this_body = direct_get_object(node,
                            part, info['account'], info['container'],
                            row['name'], resp_chunk_size=65536)
                        this_timestamp = float(these_headers['x-timestamp'])
                        if this_timestamp > timestamp:
                            timestamp = this_timestamp
                            headers = these_headers
                            body = this_body
                    except ClientException, err:
                        # If any errors are not 404, make sure we report the
                        # non-404 one. We don't want to mistakenly assume the
                        # object no longer exists just because one says so and
                        # the others errored for some other reason.
                        if not exc or exc.http_status == 404:
                            exc = err
+ except (Exception, Timeout), err:
+ exc = err
                if timestamp < looking_for_timestamp:
                    if exc:
                        raise exc

I have not submitted a fix because I'm not sure I understand the implications of not getting the absolute latest object copy. I would have thought it's ok if we had a good response from *any* object-server and use it to sync the remote side. At worst, it's a old copy and the remote will ignore it because the timestamp is older....and if this container-server's database is slightly out of date, as soon as we replication brings in the latest object, it will trigger a sync again.

I see a related issue with 404 not-found in bug #1068423. In that case although we might have retrieved two copies, because one copy is missing, we raise the exc and give up.

Also note, the handoff nodes are not used. For bug #1068423 the missing copy might simply be on a handoff node and not yet replicated.

Revision history for this message
Donagh McCabe (donagh-mccabe) wrote :

On re-reading the code I see my comment about bug #1068423 is wrong -- it only applies if all copies are missing.

Changed in swift:
assignee: nobody → Donagh McCabe (donagh-mccabe)
status: New → In Progress
Changed in swift:
assignee: Donagh McCabe (donagh-mccabe) → Darrell Bishop (darrellb)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/14824
Committed: http://github.com/openstack/swift/commit/00c3fde8f841ae20810a8dc86d57f6231f2eba43
Submitter: Jenkins
Branch: master

commit 00c3fde8f841ae20810a8dc86d57f6231f2eba43
Author: Donagh McCabe <email address hidden>
Date: Thu Nov 1 14:52:21 2012 -0700

    Handle down object servers in container-sync

    If an object server is down, container-sync stops syncing the container
    even if the it gets object copies from "up" obejct servers.

    Bug 1069910

    In case the git history gets mangled, this fix was done almost entirely
    by Donagh McCabe <email address hidden>.

    Change-Id: Ieeadcfeb4e880fe5f08e284d7c12492bf7a29460

Changed in swift:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in swift:
milestone: none → 1.7.5
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.