OpenStack Object Storage (swift)

Container sync stops if object server is down

Bug #1069910 reported by Donagh McCabe on 2012-10-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Fix Released	Undecided	Darrell Bishop	OpenStack Object Storage (swift) 1.7.5

Bug Description

Container sync stops syncing a container if the object-server is down (refusing connection or timeout).

container/sync.py container_sync_row() calls direct_get_client() for each object copy. If an object-server is down, this fails with connection refused and ends up in an outer exception handler -- which returns False. This means that the container cannot be synced once an object server goes down. In a large system, the chances of one object server being down is high. Any non-trivial sized container is bound to have some object residing on every node in the system. Hence it's easy for container sync to stop working.

The relevant code looks like. The suggested fix is marked with + (i.e., catch Exception and Timeout *here* rather than in the outer loop)

                for node in nodes:
                    try:
                        these_headers, this_body = direct_get_object(node,
                            part, info['account'], info['container'],
                            row['name'], resp_chunk_size=65536)
                        this_timestamp = float(these_headers['x-timestamp'])
                        if this_timestamp > timestamp:
                            timestamp = this_timestamp
                            headers = these_headers
                            body = this_body
                    except ClientException, err:
                        # If any errors are not 404, make sure we report the
                        # non-404 one. We don't want to mistakenly assume the
                        # object no longer exists just because one says so and
                        # the others errored for some other reason.
                        if not exc or exc.http_status == 404:
                            exc = err
+ except (Exception, Timeout), err:
+ exc = err
                if timestamp < looking_for_timestamp:
                    if exc:
                        raise exc

I have not submitted a fix because I'm not sure I understand the implications of not getting the absolute latest object copy. I would have thought it's ok if we had a good response from *any* object-server and use it to sync the remote side. At worst, it's a old copy and the remote will ignore it because the timestamp is older....and if this container-server's database is slightly out of date, as soon as we replication brings in the latest object, it will trigger a sync again.

I see a related issue with 404 not-found in bug #1068423. In that case although we might have retrieved two copies, because one copy is missing, we raise the exc and give up.

Also note, the handoff nodes are not used. For bug #1068423 the missing copy might simply be on a handoff node and not yet replicated.

Revision history for this message

Donagh McCabe (donagh-mccabe) wrote on 2012-10-22:

On re-reading the code I see my comment about bug #1068423 is wrong -- it only applies if all copies are missing.

OpenStack Infra (hudson-openstack) on 2012-10-25

Changed in swift:
assignee:	nobody → Donagh McCabe (donagh-mccabe)
status:	New → In Progress

OpenStack Infra (hudson-openstack) on 2012-11-01

Changed in swift:
assignee:	Donagh McCabe (donagh-mccabe) → Darrell Bishop (darrellb)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-11-02: Fix merged to swift (master)

Reviewed: https://review.openstack.org/14824
Committed: http://github.com/openstack/swift/commit/00c3fde8f841ae20810a8dc86d57f6231f2eba43
Submitter: Jenkins
Branch: master

commit 00c3fde8f841ae20810a8dc86d57f6231f2eba43
Author: Donagh McCabe <email address hidden>
Date: Thu Nov 1 14:52:21 2012 -0700

Handle down object servers in container-sync

If an object server is down, container-sync stops syncing the container
even if the it gets object copies from "up" obejct servers.

Bug 1069910

In case the git history gets mangled, this fix was done almost entirely
by Donagh McCabe <email address hidden>.

Change-Id: Ieeadcfeb4e880fe5f08e284d7c12492bf7a29460

Changed in swift:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2012-11-07

Changed in swift:
milestone:	none → 1.7.5
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.