Canonical Juju

missing unit for leader

Bug #1921336 reported by Hua Zhang on 2021-03-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

Our customer is doing series upgrade gnocchi and mongodb were there in same container, and gnocchi upgrade-series was going into error state, so once they removed mongodb from the container gnocchi upgrade went fine.

Then they redeployed mongodb on different container facing the issue 'no replset config has been received'.

mongodb/6 maintenance executing 42/lxd/9 10.110.244.146 27017/tcp,27019/tcp,27021/tcp,28017/tcp no replset config has been received
nrpe/162 waiting allocating 10.110.244.146 agent initializing
mongodb/7 maintenance executing 43/lxd/9 10.110.244.147 27017/tcp,27019/tcp,27021/tcp,28017/tcp no replset config has been received
nrpe/161 waiting allocating 10.110.244.147 agent initializing

Obviously, missing unit for leader is the root cause, it causes init_replset [1] not to be run so the issue happens.

$ juju run --unit mongodb/20 is-leader
False
$ juju run --unit mongodb/21 is-leader
False

But why is leader missing? The above is all that has been done, then we tried:

1, we removed the application several times with the former name, and it always failed.

juju remove-application mongodb --force
juju deploy mongodb -n 2 --constraints "spaces=oam-space" --bind "internal-space configsvr=internal-space data=internal-space database=internal-space mongos=internal-space mongos-cfg=internal-space nrpe-external-master=internal-space replica-set=internal-space" --to lxd:42,lxd:43

2, we restarted juju agent and juju unit on two hosts according to lp:1810331 [2], it failed as well.

3, Finally redeploying with a diffeent name fixed the issue.

and I also did many tests but it didn't reproduce. I also analyzed some data.

1, unitstates.json shows leader is false for both mongodb/20 and mongodb/21, see https://paste.ubuntu.com/p/ss8Rc4yKTv/

2, settings.json shows there is no mongodb/20 and mongdob/21, see - https://paste.ubuntu.com/p/GJgsWwmkYq/

The present version is: series=bionic, cs:mongodb-54, mongodb=3.6.3

[1] https://git.launchpad.net/charm-mongodb/tree/hooks/hooks.py#n1300
[2] https://bugs.launchpad.net/juju/+bug/1810331

See original description

Tags:

Hua Zhang (zhhuabj) on 2021-03-25

tags:

added: sts

Hua Zhang (zhhuabj) on 2021-03-26

description:

updated

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2021-03-31:

Our customer encountered this problem again when they upgraded juju controller and model from 2.8.9 to 2.8.10 as a upgradation procedure. mongodb went again in maintenance status, and there was no any leader at that time. see - https://paste.ubuntu.com/p/xYMqrQQ5dm/

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-04-06:

This looks like the result of having the leadership pinned during the upgrade-series preparations step, then the pinned unit removed.

This would mean that the completion step did not cause the leadership to be unpinned from the removed unit (no longer associated with the machine), so leadership remains locked to an absent unit.

A work-around if the cloud can withstand the churn is to:
- Stop the controller machine agents.
- Delete the contents of /var/lib/juju/raft on each controller.
- Restart the agents.

This will cause Raft to re-elect new leaders for all applications, but it will evict the frozen leader.

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

John A Meinel (jameinel) wrote on 2021-04-07: Re: [Bug 1921336] Re: missing unit for leader

I thought we determined that Raft in HA doesn't actually get restored
properly. So this is a dangerous workaround. (Raft only auto-restores in
singleton controllers).

On Tue, Apr 6, 2021 at 6:41 AM Joseph Phillips <email address hidden>
wrote:

> This looks like the result of having the leadership pinned during the
> upgrade-series preparations step, then the pinned unit removed.
>
> This would mean that the completion step did not cause the leadership to
> be unpinned from the removed unit (no longer associated with the
> machine), so leadership remains locked to an absent unit.
>
> A work-around if the cloud can withstand the churn is to:
> - Stop the controller machine agents.
> - Delete the contents of /var/lib/juju/raft on each controller.
> - Restart the agents.
>
> This will cause Raft to re-elect new leaders for all applications, but
> it will evict the frozen leader.
>
> ** Changed in: juju
> Status: New => Triaged
>
> ** Changed in: juju
> Importance: Undecided => Medium
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1921336
>
> Title:
> missing unit for leader
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1921336/+subscriptions
>

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.