Bug #2008509 “upgrade-series prepare puts units into failed stat...” : Bugs : OpenStack Charm Guide

Diko Parvanov (dparv) on 2023-02-24

description:

updated

Juan M. Tirado (tiradojm) on 2023-03-02

Changed in juju:
importance:	Undecided → High
milestone:	none → 2.9.43
status:	New → Triaged
tags:	added: subordinate upgrade-charm

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-05-09:

#1

I hit this issue in production with series-upgrade of a Yoga OpenStack cloud from focal->jammy. The issue for us was the hacluster charm. The 2.0.3 charm only supports focal while the 2.4 charm supports both focal+jammy. It still occurs on 2.9.42.

It's not clear from the original description, but the "upgrade-series prepare" does error with the following, after typing yes to confirm the upgrade:
"ERROR charm "hacluster" does not support jammy, force not used"

However the pre-upgrade-series hooks run in the background anyway, even though the juju client exits after that error. Then the hacluster unit goes into the failed state.

In our debugging, db.machineUpgradeSeriesLocks is empty. You can re-run prepare with --force which then creates a lock however the units are still stuck in a failed state. It seems that still leaves the object in a state the transcation won't let happen - perhaps because the hooks already ran.

= Workaround =
If you only attempted the "prepare" on a single unit, you can force-remove that unit, scale it back out, upgrade the hacluster charm and then proceed with a series upgrade. I was not able to find a way to get the broken unit out of the broken state.

juju remove-unit keystone/0 --force
juju add-unit keystone
juju upgrade-charm keystone --channel 2.4/stable

= Reproducer =
You can deploy a simple bundle with keystone and hacluster to reproduce the issue. I have attached the bundle as keystone-focal-yoga.yaml

juju add-model keystone1
juju deploy ./keystone-focal-yoga.yaml
juju upgrade-series 0 prepare jammy

= Expectations =

- This is a critical issue that needs prioritising for a new 2.9.43 release.

- However we also need to determine if we can easily fix this situation as people are very likely to get stuck and removing and scaling the broken unit is very error prone in practice and best avoided if possible.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-05-09:

#2

keystone-focal-yoga.yaml Edit (624 bytes, text/plain)

summary:

- Juju should restrict upgrade-series prepare if a subordinate charm
- doesn't support the new series
+ upgrade-series prepare puts units into failed state if a subordinate
+ does not support the target series

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-05-09:

#3

I have confirmed that this also happens on a 3.2-beta3 controller with freshly deployed version of the above reproducer

3.2/beta: 3.2-beta3 2023-04-24 (22984) 80MB -

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-05-09:

#4

Adding a task for the charm-guide since some guidance should be provided in the series upgrade page - https://docs.openstack.org/charm-guide/latest/admin/upgrades/series.html

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-05-09: Re: [Bug 2008509] Re: upgrade-series prepare puts units into failed state if a subordinate does not support the target series

#5

On Tue, 2023-05-09 at 15:46 +0000, Felipe Reyes wrote:
> Adding a task for the charm-guide since some guidance should be provided
> in the series upgrade page - https://docs.openstack.org/charm-
> guide/latest/admin/upgrades/series.html

the kind of guidance should be around making identifying what charms are focal only, what charms are
focal and jammy, and what tracks the users are expected to be using when going from focal to jammy.

PS: be aware of bug https://bugs.launchpad.net/juju/+bug/2008248

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2023-05-15:

#6

Download full text (4.8 KiB)

Hi Felipe,
The bundle above doesn't work for me. Does it require the OpenStack cloud?

unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed Traceback (most recent call last):
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 937, in <module>
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed main()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 930, in main
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed hooks.execute(sys.argv)
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/hookenv.py", line 963, in execute
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed self._hooks[hook_name]()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1896, in wrapped_f
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed return restart_on_change_helper(
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/host.py", line 863, in restart_on_change_helper
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed r = lambda_f()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1897, in <lambda>
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed (lambda: f(*args, **kwargs)),
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 603, in ha_changed
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed CONFIGS.write_all()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 325, in write_all
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed self.write(k)
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 313, in write
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed _out = self.render(config_file).encode('UTF-8')
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 273, in render
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed ctxt = ostmpl.context()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 107, in context
unit-keystone-0: 09:36:36 WAR...

Hi Felipe,
The bundle above doesn't work for me. Does it require the OpenStack cloud?

unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed Traceback (most recent call last):
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 937, in <module>
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     main()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 930, in main
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     hooks.execute(sys.argv)
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/hookenv.py", line 963, in execute
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     self._hooks[hook_name]()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1896, in wrapped_f
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     return restart_on_change_helper(
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/host.py", line 863, in restart_on_change_helper
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     r = lambda_f()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1897, in <lambda>
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     (lambda: f(*args, **kwargs)),
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 603, in ha_changed
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     CONFIGS.write_all()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 325, in write_all
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     self.write(k)
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 313, in write
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     _out = self.render(config_file).encode('UTF-8')
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 273, in render
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     ctxt = ostmpl.context()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 107, in context
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     _ctxt = context()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/keystone_context.py", line 235, in __call__
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     resolve_address(PUBLIC),
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed   File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/ip.py", line 214, in resolve_address
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed     raise ValueError("Unable to resolve a suitable IP address based on "
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed ValueError: Unable to resolve a suitable IP address based on charm state and configuration. (net_type=os-public-network, clustered=True)
unit-keystone-0: 09:36:36 ERROR juju.worker.uniter.operation hook "ha-relation-changed" (via explicit, bespoke hook script) failed: exit status 1
unit-keystone-0: 09:36:36 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
controller-0: 09:37:31 INFO juju.worker.provisioner provisioning in zones: [ap-southeast-2a ap-southeast-2b ap-southeast-2c]
controller-0: 09:37:31 INFO juju.worker.provisioner provisioning in zones: [ap-southeast-2a ap-southeast-2b ap-southeast-2c]
controller-0: 09:37:31 INFO juju.worker.provisioner provisioning in zones: [ap-southeast-2a ap-southeast-2b ap-southeast-2c]
controller-0: 09:37:31 INFO juju.worker.provisioner provisioning in zones: [ap-southeast-2a ap-southeast-2b ap-southeast-2c]

I deployed the ubuntu and ntp charms. But I cann't reproduce it.

https://pastebin.ubuntu.com/p/NRV9dr3K5c/

Peter Matulis (petermatulis) on 2023-05-22

Changed in charm-guide:
importance:	Undecided → High
status:	New → Triaged

Peter Matulis (petermatulis) on 2023-05-23

Changed in charm-guide:
assignee:	nobody → Peter Matulis (petermatulis)

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-05-29:

#7

The test bundle is using the hacluster charm, which needs to setup a Virtual IP on the interface (using corosync/pacemaker).

To do this it requires the following options set specific to the deployed environment. Apologies I didn't think about that in my original description.

(1) vip_iface is set to the real interface (for AWS: eth0)
(2) It also requires that all 3 machines have an IP address in the same subnet, and an IP in that subnet is set as the config option "vip". Since AWS uses different subnets for each AZ, we must lock the application to a single region's availability zone.

I made it work on AWS by doing the following

(1) Change the keystone constraints to include "zones=s-east-1b" or any specific AZ (e.g. ap-southeast-2a). For my usage, I also set instance-type=t2.micro as constraints for both mysql-innodb-cluster and keystone.

(2) Find the private VPC subnet for that specific AZ in your AWS console under VPC -> Subnets.. for me it was:
172.31.32.0/20 (172.31.32.0 - 172.31.47.255, you can calculate this with sipcalc)

(3) Modify 'vip' to be an IP in this subnet that is not in use, e.g. in my case I could use: 172.31.47.240

Then the model should deploy and reproduce the issue. Otherwise I'd suggest to create and publish a test subordinate charm with 20.04 support only, and try with that.

Note: This VIP won't actually work in an AWS environment, but that doesn't matter. We just have to convince the charm to deploy so that the juju issue can be reproduced. There is no need to access or use the deployed application.

If that doesn't work I'd suggest publishing a charm that supports 20.04 only. But one without base-specific builds (otherwise you'll separately hit Bug #2008248)

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2023-06-02:

#8

Hi Diko, Trent,

From my testing, Juju does check the supported series in both principal and subordinate charms before running the `pre-series-upgrade` hook.
So the only way to reproduce this is to run `upgrade-series` command with `--force`.
Can you confirm whether you run the command with --force or not?

Here are the steps I did: https://pastebin.ubuntu.com/p/K66Svt8KpT/

So If you unfortunately run into this situation(the charm's `pre-series-upgrade` hook failed) with --force,
we have to do some workaround to fix it (Juju doesn't support aborting the process currently).

Canonical Juju QA Bot (juju-qa-bot) on 2023-06-02

Changed in juju:
milestone:	2.9.43 → 2.9.44

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-06-13 (last edit on 2023-06-13):

#9

In your steps (https://pastebin.ubuntu.com/p/K66Svt8KpT/), after you get this error (without --force):
ERROR series "jammy" not supported by charm "local:focal/lxd-profile-subordinate-0", supported series are: quantal, xenial, bionic, focal

Did you actually check 'juju status'?

My issue, which happens without --force, is that despite giving that error it ran the hooks anyway and then the units get into a bad state. Another 'prepare' without --force then fails.

A second prepare with --force also fails.

It 100% reproduces with my updated bundle in AWS - have tested it.

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2023-06-21:

#10

Hi, I am pretty sure those units are active after `juju upgrade-series 0 prepare jammy`.
Because I was able to run `juju upgrade-series 0 prepare jammy --force`.
If units were in error state, you won't be able to run this step.

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2023-06-21:

#11

Actually, I just tested it again to confirm it.

https://pastebin.ubuntu.com/p/NCXqp7bfzs/

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-06-21 (last edit on 2023-06-21):

#12

Right, it seems for some reason this issue triggers in some cases and not in others.

It seems for whatever reason this lxd-profile-subordinate charm doesn't trigger the issue. But other real charms including nrpe and hacluster, do.

So it would be great if you can try my hacluster example in AWS again, using the details I provided in comment #7.

Then we can figure out why it only triggers with some charms and not othrs.

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2023-06-22 (last edit on 2023-06-23):

#13

Hi
Appreciate it if you can provide the exact steps with juju status and logs in Pastebin when you can reproduce them.
Just like what I had done above,

Revision history for this message

Trent Lloyd (lathiat) wrote on 2023-07-05:

#14

Kelvin,

I have again re-created it. I'd like to re-iterate that this is 100% reproducible every time I do it with these charms, and that this also happened in a production environment causing quite an impact.

This issue (among others) will cause problems for anyone upgrading an Ubuntu Openstack from Focal to Jammy which is likely to become more and more common soon. This is a high impact issue, please can you take some extra time to comprehensively try my reproducer? It's fairly simple and only takes about 30 minutes at most, most of that time is waiting.

I understand in the first case you didn't have a matching VIP (that was an oversight on my part), but I have otherwise given very detailed reproduction instructions which you haven't attempted again.

To assist with that, I have modified my original reproducer script to automatically extract and set an appropriate VIP address from "juju subnets" to ensure it will deploy cleanly on anyones AWS environment. You will need to use us-east-1a or otherwise modify that to any other regions specific AZ in all of the commands and the bundle file to use a different AZ. We need to use a specific AZ because the 'vip' config option has to match the VPC subnet of the AZ we deploy into.

I have attached the following:
- lp2008509-aws-reproduction-terminal.txt - complete terminal output of reproducing the issue with the below script
- juju-crashdump-414db7ba-6214-41f7-aed5-85a5edf8f03a.tar.xz - juju-crashdump of the environment after the failed upgrade-machine
- juju-backup-20230704-090223.tar.gz - juju controller backup after the failed upgrade-machine

We need two outcomes
- First we need to develop a workaround for once someone gets into this situation, how to get out of it without removing the broken unit. It's not trivial to remove and replace units in many OpenStack deployments - that's very disruptive - and people seem almost certain to attempt and hit this issue even once it's fixed when they haven't upgraded juju
- We need a fix including backport to 2.9 (this reproduces on 2.9.43, 3.1.2 and 3.2.0 all the same)

# Revised reproducer script
# Requires 'jq' and 'python3' installed

juju bootstrap aws aws --bootstrap-constraints "instance-type=t2.micro arch=amd64 zones=us-east-1a"

juju add-model lp2008509

juju set-model-constraints instance-type=t2.micro arch=amd64 zones="us-east-1a"

VPC_VIP=$(python3 -c 'import random, ipaddress, sys; print(str(random.choice(list(ipaddress.ip_network(sys.argv[1]).hosts()))))' $VPC_SUBNET)

echo VPC Subnet: ${VPC_SUBNET}, VPC VIP: ${VPC_VIP}

# Note that this file has had the 'vip' field removed, compared to the one originally uploaded. Be sure to remove it from yours.
juju deploy ./keystone-focal-yoga.yaml

juju config keystone vip=${VPC_VIP}

juju wait-for application keystone --query='name=="keystone" && (status=="active" || status=="idle")'

juju status

juju upgrade-machine 0 prepare ubuntu@22.04