leader_id stale/incorrect; causes rsync cron job missing on leader unit

Bug #1797297 reported by Haw Loeung
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Repository Cache Charm
Fix Released
High
Haw Loeung

Bug Description

Hi,

As seen today, leader_id was stale/incorrect:

| Unit Workload Agent Machine Public address Ports Message
| ubuntu-repository-cache/0* unknown idle 0 51.140.142.48 80/tcp
| ...
| ubuntu-repository-cache/2 unknown idle 2 51.140.9.50 80/tcp
| ...

| ubuntu@machine-0:~$ sudo juju-run ubuntu-repository-cache/0 "leader-get"
| leader_id: ubuntu-repository-cache/2

| ubuntu@machine-2:~$ sudo juju-run ubuntu-repository-cache/2 "leader-get"
| leader_id: ubuntu-repository-cache/2

leader_id only gets set on leader-elected hook firing. I think we should also have it run on config-changed or some other to ensure that leader_id isn't stale.

Bit of evidence - https://pastebin.canonical.com/p/9qDdJ6jv45/

| 2018-10-11 01:09:20 WARNING juju-log cluster:1: Leader changed between peer_update_metadata and _nonleader_update_metadata

Or even when the sync job runs from cron:

| 2018-10-11 02:23:36,164 - Executing hook: ['juju-run', 'ubuntu-repository-cache/0', '/var/lib/juju/agents/unit-ubuntu-repository-cache-0/charm/hooks/ubuntu-repository-cache-sync ubuntu_2018-10-11_02:23:01_u0']

Have hooks/ubuntu-repository-cache-sync check and ensure leader_id isn't stale.

Related branches

Haw Loeung (hloeung)
description: updated
Revision history for this message
Stuart Bishop (stub) wrote :

Updating this leader setting more aggressively will help, and a quick fix so can be done.

However, a setting like this is not reliable - it can only ever state which unit *was* the leader, and cannot state which unit *is* the leader. The only unit that can reliably know who the leader is is the leader itself (by calling is-leader). We should drop this leadership setting, and the charm refactored to only use 'is-leader' and not 'is-leader'. To communicate with the leader, place the message on the per relation. Any message being sent to a specific unit because it *was* the leader is a bug, because there is no guarantee that the unit will still be the leader when the message arrives.

Revision history for this message
Stuart Bishop (stub) wrote :

Ideally, the only thing the lead unit does is to select which of the unit is primary. We don't want the primary unit in an ubuntu-repository-cache deployment to flap every time there is a netspit (which will trigger Juju leadership elections and flapping)

Revision history for this message
Haw Loeung (hloeung) wrote :
Changed in ubuntu-repository-cache:
status: New → Triaged
importance: Undecided → High
Haw Loeung (hloeung)
Changed in ubuntu-repository-cache:
assignee: nobody → Haw Loeung (hloeung)
Haw Loeung (hloeung)
Changed in ubuntu-repository-cache:
status: Triaged → In Progress
Revision history for this message
Haw Loeung (hloeung) wrote :

To add:

2020-12-17 16:17:45 INFO juju.worker.leadership tracker.go:194 ubuntu-repository-cache/1 promoted to leadership of ubuntu-repository-cache
2020-12-17 16:18:07 INFO juju-log Reactive main running for hook leader-elected
tracer: ++ queue handler reactive/ubuntu-repository-cache.py:218:leader_elected
2020-12-17 16:18:07 INFO juju-log Invoking reactive handler: reactive/ubuntu-repository-cache.py:218:leader_elected
2020-12-17 16:18:07 INFO juju-log leader-elected fired. This is not the leader
2020-12-17 16:18:07 INFO juju.worker.uniter.operation runhook.go:142 ran "leader-elected" hook (via explicit, bespoke hook script)
2020-12-17 22:28:25 INFO juju.worker.leadership tracker.go:194 ubuntu-repository-cache/1 promoted to leadership of ubuntu-repository-cache
2020-12-17 22:48:09 INFO juju-log cluster:1: Updating metadata on the leader
2020-12-17 22:48:09 WARNING juju-log cluster:1: Leader changed between peer_update_metadata and _leader_update_metadata

And:

| ubuntu-repository-cache/1* unknown idle 1 20.195.53.225 80/tcp

However:

| ubuntu@machine-1:~$ sudo juju-run ubuntu-repository-cache/1 "leader-get"
| leader_id: ubuntu-repository-cache/0

summary: - leader_id stale/incorrect
+ leader_id stale/incorrect; causes rsync cron job missing on leader unit
Revision history for this message
Haw Loeung (hloeung) wrote :

See also LP:1723184

Revision history for this message
Haw Loeung (hloeung) wrote :

See also LP:1909569

Haw Loeung (hloeung)
Changed in ubuntu-repository-cache:
status: In Progress → Fix Committed
Haw Loeung (hloeung)
Changed in ubuntu-repository-cache:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.