"NOT tolerant to any failures" status shouldn't be considered as green
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL InnoDB Cluster Charm |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
When there is a failure node out of 3 units, Juju status goes to "Cluster is NOT tolerant to any failures. 1 member is not active", which is good. However, the workload status itself is still "active" and it shouldn't be considered as green. The expected status is "blocked" in this case since it requires a human intervention.
Steps:
1. deploy (in this case, LXD provider with a single L2 network for simplicity as the charm acts differently with L3)
$ juju deploy --series focal mysql-innodb-
[status]
Unit Workload Agent Machine Public address Ports Message
mysql-innodb-
mysql-innodb-
mysql-innodb-
2. simulate a node failure
juju ssh mysql-innodb-
sudo systemctl mask mysql.service
sudo systemctl kill -s 9 mysql.service
sudo systemctl stop jujud-machine-
"
[status]
Unit Workload Agent Machine Public address Ports Message
mysql-innodb-
mysql-innodb-
mysql-innodb-
^^^ the status turned into "NOT tolerant to any failures" but the workload status is active/green. This should be blocked or something.
3. remove the failed machine
$ juju remove-machine --force 0
[status]
Unit Workload Agent Machine Public address Ports Message
mysql-innodb-
mysql-innodb-
^^^ "'cluster' incomplete" is also a weird status. In this step, all units should be blocked or something.
4. re-add an unit.
$ juju add-unit mysql-innodb-
Unit Workload Agent Machine Public address Ports Message
mysql-innodb-
mysql-innodb-
mysql-innodb-
^^^ all green as expected.
Changed in charm-mysql-innodb-cluster: | |
status: | New → Triaged |
importance: | Undecided → Wishlist |
tags: | added: good-first-bug |
I don't disagree with the sentiment in the bug report. The active, blocked, error, unknown statuses are currently viewed (by the charm) from the perspective of "whether that instance is working correctly". The 'cluster-level' view status doesn't really exist, as it's a meta-state of the complete cluster.
We'd need to change the meaning of the unit status *when clustered* to mean something subtly-else which is a compound of the individual unit status and the overall cluster health, which is what the status message is trying to indicate - i.e. the charm is aware of the cluster status.
Ideally, there would be a 'degraded' status (or similar) that could be set to indicate a cluster-level status, AND the ability of the units to to indicate their individual statuses. Sadly, this isn't currently available.
I'd probably be okay with changing the status to blocked in the event that the cluster is degraded, but interested in other viewpoints.