Canonical Juju

repeated ErrInvalid can cause lease.Manager to be stopped

Bug #1815719 reported by John A Meinel on 2019-02-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	John A Meinel	Canonical Juju 2.6-beta1
	2.5	Fix Released	Critical	John A Meinel	Canonical Juju 2.5.1

Bug Description

We saw this in the wild. It used to be that ClaimLease spun in a tight loop whenever it got ErrInvalid and never stopped retrying.
We changed it so that it would keep trying but using the same exponential backoff that we use for ErrTimeout.
However, ErrTimeout only retried 5 times, backing off 2x each step starting at 50ms.
So
50ms
100ms
200ms
400ms
800ms
# final failure.

It seems that we aren't able to get the Leases tracked on *this* node to be in-sync within that time, such that we would generate a ErrClaimDenied (we see that someone else does own this lease).

Probably if we get N attempts to claim a lease, and always get ErrInvalid, we should finish with ErrClaimDenied, even if we can't tell who does hold the lease.

It shouldn't be too risky, as the next thing the client should do is call BlockUntilLeadershipReleased, and if we still haven't gotten in sync by that point, then our Leases collection won't have the lease record, and will immediately unblock them, and they will try to Claim again.

ErrInvalid should *definitely* not kill the Manager loop.

I think we also have a different fundamental issue if we are seeing failure to get convergence in raft letting us know who the *master* thinks has the Lease after ~1.5s.

We can easily bump the number of retries, and/or change it from 2x backoff (retry.Exponential has a Factor we can set.))

Note that for ErrTimeout, its already a bit strange. Specifically, Timeout means it took 5s to attempt a request. But then if we fail after 5s, we then only sleep for 50ms before trying again. The scale of the times don't seem quite correct.

Tags:

John A Meinel (jameinel) on 2019-02-13

Changed in juju:
milestone:	none → 2.6-beta1

Revision history for this message

John A Meinel (jameinel) wrote on 2019-02-13:

https://github.com/juju/juju/pull/9745 is a PR against 2.5 to fix this behavior.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-02-13:

Note that this seems to raise the priority of bug #1814735

Revision history for this message

John A Meinel (jameinel) wrote on 2019-02-13:

https://github.com/juju/juju/pull/9748 merges the fix into 2.6

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2019-05-13

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.