Unable to defer "storage_detaching" or tear-down related events
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Medium
|
Unassigned |
Bug Description
----
*Context:*
For databases, a scale-down event is an event that must be handled with care in order to avoid any risk of data loss.
Sometimes, it is necessary to be able to postpone the removal of a unit. When it is not safe to do so:
- at a particular moment (i.e the primary copies of the data may not be able to be relocated anywhere in the cluster)
- in a particular fashion (i.e removing the majority of the nodes at once, or simply removing multiple units at once as opposed to in a rolling manner.)
It becomes necessary to be able to defer a teardown of a unit until it can be handled gracefully. And keeping the unit very well functioning as if no termination event happened before.
----
*Current:*
It is currently not possible to defer the termination process of a unit naturally (event.defer()).
The only way (hack) to do so is by putting the unit in an error state in "storage_
This comes at a price:
- this unit does NOT receive subsequent events, it only keeps retrying the failed termination event. Effectively making this unit "diminished" compared to the rest of the nodes.
This has broader impacts, such as:
- Assuming the leader unit is the one that received the termination event, when this unit errors the "storage_detaching" event, this effectively prevents the "leader reelection" process to happen.
Which causes the fact that all hooks with a processing specifically assigned to the leader_unit will not trigger and the cluster will eventually be in an unexpected and unstable state.
----
*Environment:*
- Juju: 2.9.38.1
----
Thank you
There is an ongoing discussion about this issue. No decisions have been made so far about how to approach this problem.