Etcd Charm

etcd remains unhealthy after unit removal

Bug #1967569 reported by Berkay Tekin Öz on 2022-04-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Etcd Charm	Fix Released	Undecided	Berkay Tekin Öz	Etcd Charm 1.24+ck1

Bug Description

Removing any unit(leader or not) from etcd results in etcd being stuck at an unhealthy state. The main cause seems to be that the etcd peers are not getting updated as necessary, resulting in dangling peers(removed units) in the cluster that are unreachable.

Steps to reproduce:

1. Deploy easyrsa with `juju deploy cs:~containers/easyrsa-441`
2. Deploy etcd with `juju deploy cs:~containers/etcd-655`
3. Relate etcd and easyrsa with `juju add-relation etcd easyrsa`
4. Add 2 more etcd units `juju add-unit -n 2 etcd`
5. Remove a unit from etcd `juju remove-unit etcd/2`

Some related logs can be seen below:

unit-etcd-1: 21:02:16 INFO unit.etcd/1.juju-log Invoking reactive handler: reactive/etcd.py:112:check_cluster_health
unit-etcd-1: 21:02:18 ERROR unit.etcd/1.juju-log ['/snap/bin/etcd.etcdctl', 'cluster-health']
unit-etcd-1: 21:02:18 ERROR unit.etcd/1.juju-log {'ETCDCTL_API': '2', 'ETCDCTL_CA_FILE': '/var/snap/etcd/common/ca.crt', 'ETCDCTL_CERT_FILE': '/var/snap/etcd/common/server.crt', 'ETCDCTL_KEY_FILE': '/var/snap/etcd/common/server.key'}
unit-etcd-1: 21:02:18 ERROR unit.etcd/1.juju-log b'member 4092336adfba56b6 is healthy: got healthy result from https://10.20.194.211:2379\nfailed to check the health of member c5f431bd0a6193f3 on https://10.20.194.223:2379: Get https://10.20.194.223:2379/health: dial tcp 10.20.194.223:2379: connect: no route to host\nmember c5f431bd0a6193f3 is unreachable: [https://10.20.194.223:2379] are all unreachable\nmember d24dec4fbb4997cd is healthy: got healthy result from https://10.20.194.70:2379\ncluster is degraded\n'
unit-etcd-1: 21:02:18 ERROR unit.etcd/1.juju-log None
unit-etcd-1: 21:02:18 WARNING unit.etcd/1.juju-log Notice: Unit failed cluster-health check