object-replicator goes into bad state when lockup timeout < rsync timeout
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
http://
http://
We have a long running rsync process as it attempts to move a lot of data across a WAN link. Rsync timeout has been increased to 60 minutes (from the default of 15). Lockup timeout was left alone (30 minutes).
It seems like once we hit this "lockup timeout" during a long running rsync, things go awry. We're moving data, but the replication partition count hasn't increased. Once this occurs we basically never get replication going again. Furthermore, we have to 'kill -9' the object replicator processes, they don't respond to normal signals.
The easiest fix was to just set the lockup timeout to a higher value than the rsync timeout, but I'm opening this ticket as Clay G. thought there might be something strange with the lockup detection and recovery that needs to be looked at.
Changed in swift: | |
importance: | High → Medium |
I think the logs from this series are pretty damning:
http:// paste.openstack .org/show/ JFyHkmdIKMqyT9X IsPAq/
It's basically like
1) things are moving (slowly)
2) meh, too slow for me - "Lockup detected.. killing live coros"
3) no progress is made ever again until you manually restart daemons