Provide an action to recover from a majority failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Etcd Charm |
In Progress
|
Medium
|
Justin Clark |
Bug Description
An HA ETCD cluster can normally be scaled down to a single node by simply removing extra units. However, if the majority of the units needs to be force removed, relation departed hooks will not have a chance to run and the surviving unit(s) will not accept new cluster members.
In order to recover from this situation, the etcd cluster has to be restarted once with the force-new-cluster option set to true. This should be wrapped in an action.
Example: let's assume we have a 3-node ETCD cluster where etcd/0 is functional, while etcd/1 and etcd/2 are unrecoverable. In order to bring the cluster back to health, an operator needs to do the following:
1. juju remove-unit --force etcd/1
2. juju remove-unit --force etcd/2
3. vim /var/snap/
4. service snap.etcd.etcd restart
5. vim /var/snap/
6. juju add-unit -n2 etcd
Lines 3 to 6 should be performed by an action.
Changed in charm-etcd: | |
importance: | Undecided → Medium |
status: | New → Triaged |
Changed in charm-etcd: | |
assignee: | nobody → Justin Clark (justinclark) |
status: | Triaged → In Progress |
Changed in charm-etcd: | |
milestone: | 1.28 → 1.28+ck1 |
Changed in charm-etcd: | |
milestone: | 1.28+ck1 → 1.29 |
Adding a link to the PR which was started to address this /github. com/charmed- kubernetes/ layer-etcd/ pull/177
https:/