(This is speculation while on a walk, not while reading through the code)
Thinking it through... If is_leader isn't refreshing but we're only doing
our async "every 30s extend the lease by 1min". If something happened to
that async loop, you could see a case where is leader returns true but it
is failing to actually extend the lease.
Even more true if we are only looking at the agents local state when
answering is leader. If there is clock skewing happening what happens if we
get the leadership token and our clock jumps backward by 1 min. It seems
possible that locally we think we're the leader but don't try to refreach
the token because our time isn't up yet.
Auditing the code to make sure we're using durations and time.Since rather
than absolute times/deadlines would allow the monotonic timer of go 1.9 to
help out.
We also need to make sure we're confident we're not doing something wrong
when time is perfectly stable.
John
=:->
On Oct 31, 2017 07:10, "John Meinel" <email address hidden> wrote:
> It would be good to know from the logs how long *we* think it was for
> those to lines to execute. On a heavily loaded system I think we've seen
> things a spike as high as 45s for a query to execute which chews up most of
> the lease time. Also if there was something like a controller restart, etc.
>
> IIRC is_leader doesn't do an immediate refresh but just checks the current
> status. It might make it more reliable if we just force a refresh at that
> point.
>
> John
> =:->
>
> On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:
>
>> Juju need to confirm whether or not we have leadership bouncing between
>> units.
>>
>> Under "normal" circumstances, where normal means that we have continued
>> network connectivity, once a unit is a leader, it should stay as leader
>> until the API connection is dropped.
>>
>> There have been reports before of leadership bouncing between units, and
>> this is something we need to investigate. It is possible that clock skew
>> could have been an issue before, but this is where the recent work has
>> gone in to mitigate that problem.
>>
>> ** Changed in: juju
>> Status: New => Triaged
>>
>> ** Changed in: juju
>> Importance: Undecided => High
>>
>> ** Changed in: juju
>> Milestone: None => 2.3.0
>>
>> ** Changed in: juju
>> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1728111
>>
>> Title:
>> pxc cluster build failed due to leadership change in early unit
>> lifecycle
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>>
>
(This is speculation while on a walk, not while reading through the code)
Thinking it through... If is_leader isn't refreshing but we're only doing
our async "every 30s extend the lease by 1min". If something happened to
that async loop, you could see a case where is leader returns true but it
is failing to actually extend the lease.
Even more true if we are only looking at the agents local state when
answering is leader. If there is clock skewing happening what happens if we
get the leadership token and our clock jumps backward by 1 min. It seems
possible that locally we think we're the leader but don't try to refreach
the token because our time isn't up yet.
Auditing the code to make sure we're using durations and time.Since rather
than absolute times/deadlines would allow the monotonic timer of go 1.9 to
help out.
We also need to make sure we're confident we're not doing something wrong
when time is perfectly stable.
John
=:->
On Oct 31, 2017 07:10, "John Meinel" <email address hidden> wrote:
> It would be good to know from the logs how long *we* think it was for /bugs.launchpad .net/bugs/ 1728111 /bugs.launchpad .net/charm- helpers/ +bug/1728111/ +subscriptions
> those to lines to execute. On a heavily loaded system I think we've seen
> things a spike as high as 45s for a query to execute which chews up most of
> the lease time. Also if there was something like a controller restart, etc.
>
> IIRC is_leader doesn't do an immediate refresh but just checks the current
> status. It might make it more reliable if we just force a refresh at that
> point.
>
> John
> =:->
>
> On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:
>
>> Juju need to confirm whether or not we have leadership bouncing between
>> units.
>>
>> Under "normal" circumstances, where normal means that we have continued
>> network connectivity, once a unit is a leader, it should stay as leader
>> until the API connection is dropped.
>>
>> There have been reports before of leadership bouncing between units, and
>> this is something we need to investigate. It is possible that clock skew
>> could have been an issue before, but this is where the recent work has
>> gone in to mitigate that problem.
>>
>> ** Changed in: juju
>> Status: New => Triaged
>>
>> ** Changed in: juju
>> Importance: Undecided => High
>>
>> ** Changed in: juju
>> Milestone: None => 2.3.0
>>
>> ** Changed in: juju
>> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https:/
>>
>> Title:
>> pxc cluster build failed due to leadership change in early unit
>> lifecycle
>>
>> To manage notifications about this bug go to:
>> https:/
>>
>