OpenStack Object Storage (swift)

object-replicator goes into bad state when lockup timeout < rsync timeout

Bug #1575277 reported by Caleb Tennis on 2016-04-26

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Confirmed	Medium	Unassigned

Bug Description

http://paste.openstack.org/show/9mwyQhMHVRYztd0c037G/
http://paste.openstack.org/show/JFyHkmdIKMqyT9XIsPAq/

We have a long running rsync process as it attempts to move a lot of data across a WAN link. Rsync timeout has been increased to 60 minutes (from the default of 15). Lockup timeout was left alone (30 minutes).

It seems like once we hit this "lockup timeout" during a long running rsync, things go awry. We're moving data, but the replication partition count hasn't increased. Once this occurs we basically never get replication going again. Furthermore, we have to 'kill -9' the object replicator processes, they don't respond to normal signals.

The easiest fix was to just set the lockup timeout to a higher value than the rsync timeout, but I'm opening this ticket as Clay G. thought there might be something strange with the lockup detection and recovery that needs to be looked at.

Revision history for this message

clayg (clay-gerrard) wrote on 2016-07-18:

I think the logs from this series are pretty damning:

http://paste.openstack.org/show/JFyHkmdIKMqyT9XIsPAq/

It's basically like

1) things are moving (slowly)
2) meh, too slow for me - "Lockup detected.. killing live coros"
3) no progress is made ever again until you manually restart daemons

Changed in swift:
status:	New → Confirmed

Revision history for this message

clayg (clay-gerrard) wrote on 2016-07-18:

I think we can close this bug when we deprecate the python replicators in favor of h9d - which can not happen soon enough!

Revision history for this message

Christian Schwede (cschwede) wrote on 2016-07-19:

I think I might have seen too, but that was a long time ago...

I can't reproduce this in my lab setting. Here's what I did:

- configured object-replicator as following:

[object-replicator]
sync_method = rsync
rsync_timeout = 60
lockup_timeout = 5
stats_interval = 2

- uploaded some multiple GB-sized objects
- deleted all but one replica of these objects, and hashes.pkl too
- replicator runs, I see the "killing live coros" - and it continues?

Not sure how to reproduce this yet.

Looking at the code I think there is another possibility to run into this in case the top-level replicator loop raises a Timeout (https://github.com/openstack/swift/blob/master/swift/obj/replicator.py#L759-L760).

Revision history for this message

clayg (clay-gerrard) wrote on 2016-09-07:

Ran into this again! Very annoying!

Changed in swift:
importance:	Undecided → High

clayg (clay-gerrard) on 2017-04-12

Changed in swift:
importance:	High → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.