missing unit for leader

Bug #1921336 reported by Hua Zhang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Our customer is doing series upgrade gnocchi and mongodb were there in same container, and gnocchi upgrade-series was going into error state, so once they removed mongodb from the container gnocchi upgrade went fine.

Then they redeployed mongodb on different container facing the issue 'no replset config has been received'.

mongodb/6 maintenance executing 42/lxd/9 10.110.244.146 27017/tcp,27019/tcp,27021/tcp,28017/tcp no replset config has been received
nrpe/162 waiting allocating 10.110.244.146 agent initializing
mongodb/7 maintenance executing 43/lxd/9 10.110.244.147 27017/tcp,27019/tcp,27021/tcp,28017/tcp no replset config has been received
nrpe/161 waiting allocating 10.110.244.147 agent initializing

Obviously, missing unit for leader is the root cause, it causes init_replset [1] not to be run so the issue happens.

$ juju run --unit mongodb/20 is-leader
False
$ juju run --unit mongodb/21 is-leader
False

But why is leader missing? The above is all that has been done, then we tried:

1, we removed the application several times with the former name, and it always failed.

juju remove-application mongodb --force
juju deploy mongodb -n 2 --constraints "spaces=oam-space" --bind "internal-space configsvr=internal-space data=internal-space database=internal-space mongos=internal-space mongos-cfg=internal-space nrpe-external-master=internal-space replica-set=internal-space" --to lxd:42,lxd:43

2, we restarted juju agent and juju unit on two hosts according to lp:1810331 [2], it failed as well.

3, Finally redeploying with a diffeent name fixed the issue.

and I also did many tests but it didn't reproduce. I also analyzed some data.

1, unitstates.json shows leader is false for both mongodb/20 and mongodb/21, see https://paste.ubuntu.com/p/ss8Rc4yKTv/

2, settings.json shows there is no mongodb/20 and mongdob/21, see - https://paste.ubuntu.com/p/GJgsWwmkYq/

The present version is: series=bionic, cs:mongodb-54, mongodb=3.6.3

[1] https://git.launchpad.net/charm-mongodb/tree/hooks/hooks.py#n1300
[2] https://bugs.launchpad.net/juju/+bug/1810331

Hua Zhang (zhhuabj)
tags: added: sts
Hua Zhang (zhhuabj)
description: updated
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Our customer encountered this problem again when they upgraded juju controller and model from 2.8.9 to 2.8.10 as a upgradation procedure. mongodb went again in maintenance status, and there was no any leader at that time. see - https://paste.ubuntu.com/p/xYMqrQQ5dm/

Revision history for this message
Joseph Phillips (manadart) wrote :

This looks like the result of having the leadership pinned during the upgrade-series preparations step, then the pinned unit removed.

This would mean that the completion step did not cause the leadership to be unpinned from the removed unit (no longer associated with the machine), so leadership remains locked to an absent unit.

A work-around if the cloud can withstand the churn is to:
- Stop the controller machine agents.
- Delete the contents of /var/lib/juju/raft on each controller.
- Restart the agents.

This will cause Raft to re-elect new leaders for all applications, but it will evict the frozen leader.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1921336] Re: missing unit for leader

I thought we determined that Raft in HA doesn't actually get restored
properly. So this is a dangerous workaround. (Raft only auto-restores in
singleton controllers).

On Tue, Apr 6, 2021 at 6:41 AM Joseph Phillips <email address hidden>
wrote:

> This looks like the result of having the leadership pinned during the
> upgrade-series preparations step, then the pinned unit removed.
>
> This would mean that the completion step did not cause the leadership to
> be unpinned from the removed unit (no longer associated with the
> machine), so leadership remains locked to an absent unit.
>
> A work-around if the cloud can withstand the churn is to:
> - Stop the controller machine agents.
> - Delete the contents of /var/lib/juju/raft on each controller.
> - Restart the agents.
>
> This will cause Raft to re-elect new leaders for all applications, but
> it will evict the frozen leader.
>
> ** Changed in: juju
> Status: New => Triaged
>
> ** Changed in: juju
> Importance: Undecided => Medium
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1921336
>
> Title:
> missing unit for leader
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1921336/+subscriptions
>

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.