The error 'inappropriate relation-changed for <unit>: unit has not joined' during upgrade

Bug #1892294 reported by Hua Zhang
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Just the unit gnocchi/2 has a failed state during upgrading juju controllers and models to 2.8.1, other units are all in active state.

$ juju status | grep -E "error|block|wait|exec|lost|down|fail"
gnocchi 4.2.5 error 2/3 gnocchi jujucharms 30 ubuntu
gnocchi/2 active failed 11/lxd/4 10.139.108.126 8041/tcp

However, we indeed see the flag run-default-upgrade-charm from the log, so upgrade-charm has been triggered according to [1].

The customer said they just ran 'juju upgrade-juju' to upgrade juju controllers and models, they didn't run 'upgrade-charm'. I'm not sure if juju upgrade-juju' will trigger upgrade-charm hook, so I did a test, which shows 'juju upgrade-juju' will not trigger upgrade-charm hook. Who triggered upgrade-charm is a mystery to me now.

We put aside who triggered upgrade-charm and went on to find two possible clues from the log:

1, the error 'TypeError: expected str, bytes or os.PathLike object, not NoneType'

2, lots of error 'inappropriate relation-changed for <unit>: unit has not joined'

2020-07-27 06:01:36 ERROR juju.worker.uniter agent.go:31 resolver loop error: preparing operation "run relation-changed (78; unit: mysql/2) hook": inappropriate "relation-changed" for "mysql/2": unit has not joined
2020-07-27 06:01:36 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: preparing operation "run relation-changed (78; unit: mysql/2) hook": inappropriate "relation-changed" for "mysql/2": unit has not joined

Above error 1) is gone now after applying for one patch [2] by hand and run the following two commands, but gnocchi/2 is still in a failed state.

juju run --unit gnocchi/2 -- 'charms.reactive -p clear_flag run-default-upgrade-charm ; charms.reactive -p clear_flag ceph.create_pool.req.sent ; hooks/update-status'
juju run --unit gnocchi/2 -- 'hooks/shared-db-relation-changed'

For above error 2), wallyworld told me juju 2.8.2 can fix the problem, it could be related to the change in how unit state is stored, 2.8.2 includes code changes necessary to address the hook execution bug.

My question is that even if 2.8.2 can fix the problem, but it was caused during upgrading controllers and models to 2.8.1, how can we continue to upgrade controllers and models from 2.8.1 to 2.8.2 when gnocchi/2 is still in a failed state, is there a manual step that needs to be done to fix already upgraded systems?

We also checked the log of mysql/2

2020-07-22 12:23:36 INFO juju-log Unit is ready
2020-07-22 12:23:53 ERROR juju.worker.dependency engine.go:671 "leadership-tracker" manifold worker returned unexpected error: error while mysql/2 waiting for mysql leadership release: error blocking on leadership release: lease manager stopped
2020-07-22 12:23:53 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: tls: use of closed connection
2020-07-22 12:23:53 ERROR juju.worker.uniter agent.go:31 resolver loop error: could not acquire lock: cancelled acquiring mutex
2020-07-22 12:23:53 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: watcher has been stopped (stopped)
2020-07-22 12:23:53 ERROR juju.worker.dependency engine.go:671 "migration-inactive-flag" manifold worker returned unexpected error: watcher has been stopped (stopped)
2020-07-22 12:23:53 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2020-07-22 12:23:53 ERROR juju.worker.uniter agent.go:34 updating agent status: connection is shut down
2020-07-22 12:23:57 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [f2b55b] "unit-mysql-2" cannot open api: unable to connect to API: dial tcp 10.139.108.184:17070: connect: connection refused
2020-07-22 12:24:01 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [f2b55b] "unit-mysql-2" cannot open api: unable to connect to API: dial tcp 10.139.108.147:17070: connect: connection refused

and checked the relation between gnocchi/2 and mysql/2

$ juju run -u gnocchi/2 -- relation-ids shared-db
shared-db:78
$ juju run -u gnocchi/2 -- relation-list -r shared-db:78
Nothing replied.

That's all the context history and logs I have of the case, if you need anthing else, pls let me know. thanks.

[1] https://gitlab.global.garrservices.it/cloud/charms/gnocchi/blob/abffaa2578b701a72fece33c51cb2cf02d54916d/reactive/layer_openstack.py#L42
[2] https://review.opendev.org/#/c/716545/4/src/reactive/gnocchi_handlers.py
[3] https://discourse.juju.is/t/wip-juju-2-8-2-release-notes/3396

Hua Zhang (zhhuabj)
tags: added: sts
Changed in juju:
status: New → Triaged
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

This may be a duplicate of https://bugs.launchpad.net/juju/+bug/1890828. I need more data to confirm please.

Can you confirm if the unit upgraded before or after the relation failures? It will be in the unit's log. I look for "running jujud".

The best recovery might be to remove the gnocchi/2 unit and add a new unit to replace it.

The errors from mysql/2 look to be a duplicate of: https://bugs.launchpad.net/juju/+bug/1891234

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

A full log from the gnocchi/2 unit would be helpful.

Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi Heather,

The customer reported this case on Jul 27,2020, I can confirm that the relation failures are after unit upgrade (as I said in the bug description, the customer didn't upgrade-charm, just ran 'juju upgrade-juju' to upgrade controllers and models, but we indeed see the flag run-default-upgrade-charm from the log). and I uploaded unit-gnocchi-2.log and machine-11-lxd-4.log in gnocchi/2, the output about 'running jujud' are as follows. Please let me know if you need more information.

$ grep -r 'running jujud' var/log/juju/machine-11-lxd-4.log
2019-04-17 16:19:31 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2019-05-13 01:48:30 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2019-12-09 05:59:45 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2019-12-09 07:16:46 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2019-12-09 08:57:53 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2019-12-09 11:03:34 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2019-12-11 14:05:23 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2019-12-14 11:21:48 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2020-03-10 14:37:31 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2020-04-26 14:22:33 INFO juju.cmd supercommand.go:57 running jujud [2.5.4 gc go1.11.6]
2020-05-04 05:59:00 INFO juju.cmd supercommand.go:57 running jujud [2.6.10 gc go1.11.13]
2020-05-04 06:51:48 INFO juju.cmd supercommand.go:83 running jujud [2.7.6 4da406fb326d7a1255f97a7391056641ee86715b gc go1.12.17]
2020-07-11 05:18:42 INFO juju.cmd supercommand.go:83 running jujud [2.7.6 4da406fb326d7a1255f97a7391056641ee86715b gc go1.12.17]
2020-07-22 12:30:38 INFO juju.cmd supercommand.go:91 running jujud [2.8.1 0 16439b3d1c528b7a0e019a16c2122ccfcf6aa41f gc go1.14.4]
2020-07-27 03:05:30 INFO juju.cmd supercommand.go:91 running jujud [2.8.1 0 16439b3d1c528b7a0e019a16c2122ccfcf6aa41f gc go1.14.4]

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi Heather,

Any update for the case, we are getting this in prod. thanks very much!

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

In the case of 1890828, the relation errors occurred after the machine upgraded and before the unit upgraded.

Per #3, #4 logs, the relation errors happened after the unit upgraded.

Unfortunately the fix for 1890828 only removes the possibility of the problem occurring in the future, it does not resolve the current state. So we cannot verify, by upgrading again, that this issue is the same and thus fixed.

To resolve this case, I recommend adding a new gnocchi unit, to be a replacement, then removing gnocchi/2.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Undecided → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.