Etcd Charm

remove HA etcd application in error state

Bug #1835537 reported by Ashley Lai on 2019-07-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Etcd Charm	Triaged	Medium	Unassigned

Bug Description

With 3 units etcd HA, remove application removed 2 units and leave one in error state.

etcd 3.1.10 error 1 etcd jujucharms 434 ubuntu
etcd/2* error idle 2/lxd/1 10.244.245.235 2379/tcp hook failed: "cluster-relation-broken"

2019-07-05 05:11:39 DEBUG cluster-relation-broken Traceback (most recent call last):
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/charm/hooks/cluster-relation-broken", line 22, in <module>
2019-07-05 05:11:39 DEBUG cluster-relation-broken main()
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/__init__.py", line 73, in main
2019-07-05 05:11:39 DEBUG cluster-relation-broken bus.dispatch(restricted=restricted_mode)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 379, in dispatch
2019-07-05 05:11:39 DEBUG cluster-relation-broken _invoke(hook_handlers)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 359, in _invoke
2019-07-05 05:11:39 DEBUG cluster-relation-broken handler.invoke()
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 181, in invoke
2019-07-05 05:11:39 DEBUG cluster-relation-broken self._action(*args)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "/var/lib/juju/agents/unit-etcd-2/charm/reactive/etcd.py", line 519, in perform_self_unregistration
2019-07-05 05:11:39 DEBUG cluster-relation-broken etcdctl.unregister(members[unit_name]['unit_id'], leader_address)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "lib/etcdctl.py", line 75, in unregister
2019-07-05 05:11:39 DEBUG cluster-relation-broken return self.run(command)
2019-07-05 05:11:39 DEBUG cluster-relation-broken File "lib/etcdctl.py", line 160, in run
2019-07-05 05:11:39 DEBUG cluster-relation-broken raise EtcdCtl.CommandFailed() from e
2019-07-05 05:11:39 DEBUG cluster-relation-broken etcdctl.CommandFailed
2019-07-05 05:11:39 ERROR juju.worker.uniter.operation runhook.go:129 hook "cluster-relation-broken" failed: exit status 1
2019-07-05 05:11:39 DEBUG juju.machinelock machinelock.go:180 machine lock released for uniter (run relation-broken (3) hook)

Ashley Lai (alai) on 2019-07-05

summary:

- remove etcd application in error state
+ remove HA etcd application in error state

Revision history for this message

George Kraft (cynerva) wrote on 2020-08-07:

We've never encountered this, but I see how it could happen. The last unit is trying to unregister itself from a cluster that no longer exists. That happens here: https://github.com/charmed-kubernetes/layer-etcd/blob/aca040b46ac80e97da8ea3135b46216cf6bb854c/reactive/etcd.py#L598

Changed in charm-etcd:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Márton Kiss (marton-kiss) wrote on 2021-05-06:

I have the same issue in a customer environment, where I scale out the units from 3 to 6, and removed first the two original non-leader ones (etcd/0 and etcd/2), then tried to remove etcd/1 who is the leader.

In this case the etcd is showing that etcd have 4 members (etcd/1, etcd/3, etcd/4, etcd/5), however the etcdctl member remove is failing because it is trying to use the already removed endpoint (etcd/0).

This can cause a problem not just during full etcd removals, but for day 2 operations where the etcd units must be relocated.

Revision history for this message

Márton Kiss (marton-kiss) wrote on 2021-05-06:

The very dirty workaround for the above was the execution of several resolve commands with no-retry option to prevent running of the cluster-relation-hook:

$ juju resolve etcd/1 --no-retry

As a side effect juju finally had 3 etcd units, however the member remove was not able to run due to wrong leader address in the hook, so etcdctl was still showing 4 units. The etcd/1 required a manual removal:

snap.etcdctl --endpoint https://<new-leader-ip>:2379 remove member <etcd-1-membership-id>

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.