repeated ErrInvalid can cause lease.Manager to be stopped
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Critical
|
John A Meinel | ||
2.5 |
Fix Released
|
Critical
|
John A Meinel |
Bug Description
We saw this in the wild. It used to be that ClaimLease spun in a tight loop whenever it got ErrInvalid and never stopped retrying.
We changed it so that it would keep trying but using the same exponential backoff that we use for ErrTimeout.
However, ErrTimeout only retried 5 times, backing off 2x each step starting at 50ms.
So
50ms
100ms
200ms
400ms
800ms
# final failure.
It seems that we aren't able to get the Leases tracked on *this* node to be in-sync within that time, such that we would generate a ErrClaimDenied (we see that someone else does own this lease).
Probably if we get N attempts to claim a lease, and always get ErrInvalid, we should finish with ErrClaimDenied, even if we can't tell who does hold the lease.
It shouldn't be too risky, as the next thing the client should do is call BlockUntilLeade
ErrInvalid should *definitely* not kill the Manager loop.
I think we also have a different fundamental issue if we are seeing failure to get convergence in raft letting us know who the *master* thinks has the Lease after ~1.5s.
We can easily bump the number of retries, and/or change it from 2x backoff (retry.Exponential has a Factor we can set.))
Note that for ErrTimeout, its already a bit strange. Specifically, Timeout means it took 5s to attempt a request. But then if we fail after 5s, we then only sleep for 50ms before trying again. The scale of the times don't seem quite correct.
Changed in juju: | |
milestone: | none → 2.6-beta1 |
Changed in juju: | |
status: | Fix Committed → Fix Released |
https:/ /github. com/juju/ juju/pull/ 9745 is a PR against 2.5 to fix this behavior.