MySQL InnoDB Cluster Charm

"NOT tolerant to any failures" status shouldn't be considered as green

Bug #1997235 reported by Nobuto Murata on 2022-11-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MySQL InnoDB Cluster Charm	Triaged	Wishlist	Unassigned

Bug Description

When there is a failure node out of 3 units, Juju status goes to "Cluster is NOT tolerant to any failures. 1 member is not active", which is good. However, the workload status itself is still "active" and it shouldn't be considered as green. The expected status is "blocked" in this case since it requires a human intervention.

Steps:

1. deploy (in this case, LXD provider with a single L2 network for simplicity as the charm acts differently with L3)

$ juju deploy --series focal mysql-innodb-cluster --channel latest/edge -n3

[status]
Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/0 active idle 0 10.0.9.169 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/1 active idle 1 10.0.9.94 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/2* active idle 2 10.0.9.57 Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.

2. simulate a node failure

juju ssh mysql-innodb-cluster/0 "
sudo systemctl mask mysql.service
sudo systemctl kill -s 9 mysql.service

sudo systemctl stop jujud-machine-0.service
"

[status]

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/0 unknown lost 0 10.0.9.169 agent lost, see 'juju show-status-log mysql-innodb-cluster/0'
mysql-innodb-cluster/1 active idle 1 10.0.9.94 Unit is ready: Mode: R/O, Cluster is NOT tolerant to any failures. 1 member is not active.
mysql-innodb-cluster/2* active idle 2 10.0.9.57 Unit is ready: Mode: R/W, Cluster is NOT tolerant to any failures. 1 member is not active.

^^^ the status turned into "NOT tolerant to any failures" but the workload status is active/green. This should be blocked or something.

3. remove the failed machine

$ juju remove-machine --force 0

[status]
Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/1 waiting idle 1 10.0.9.94 'cluster' incomplete
mysql-innodb-cluster/2* active idle 2 10.0.9.57 Unit is ready: Mode: R/W, Cluster is NOT tolerant to any failures. 1 member is not active.

^^^ "'cluster' incomplete" is also a weird status. In this step, all units should be blocked or something.

4. re-add an unit.

$ juju add-unit mysql-innodb-cluster

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/1 active idle 1 10.0.9.94 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/2* active idle 2 10.0.9.57 Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/3 active idle 3 10.0.9.143 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.

^^^ all green as expected.

Tags:

Alex Kavanagh (ajkavanagh) on 2022-11-22

Changed in charm-mysql-innodb-cluster:
status:	New → Triaged
importance:	Undecided → Wishlist
tags:	added: good-first-bug

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2022-11-22:

I don't disagree with the sentiment in the bug report. The active, blocked, error, unknown statuses are currently viewed (by the charm) from the perspective of "whether that instance is working correctly". The 'cluster-level' view status doesn't really exist, as it's a meta-state of the complete cluster.

We'd need to change the meaning of the unit status *when clustered* to mean something subtly-else which is a compound of the individual unit status and the overall cluster health, which is what the status message is trying to indicate - i.e. the charm is aware of the cluster status.

Ideally, there would be a 'degraded' status (or similar) that could be set to indicate a cluster-level status, AND the ability of the units to to indicate their individual statuses. Sadly, this isn't currently available.

I'd probably be okay with changing the status to blocked in the event that the cluster is degraded, but interested in other viewpoints.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.