upgrade-series prepare puts units into failed state if a subordinate does not support the target series

Bug #2008509 reported by Diko Parvanov
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Yang Kelvin Liu
OpenStack Charm Guide
Triaged
High
Peter Matulis

Bug Description

Using 2.9.38 client,controller and model. Upgrading focal to jammy using upgrade-series prepare with a lldpd subordinate charm started execution of pre-series-upgrade hooks but failed, because that charm doesn't support jammy. All subordinate units + principal units went into an endless loop with error status and couldn't be fixed/resolved.

Juju shouldn't trigger prepare, so pre-checks are necessary and prevent the operator to do it, unless --force is specified.

Now the cloud is in a blocked and charms can't complete with:

nova-compute-kvm-sriov/6 blocked idle 9 10.11.2.184 Ready for do-release-upgrade and reboot. Set complete when finished.
  ceilometer-agent/0 blocked idle 10.11.2.184 Services not running that should be: memcached, ceilometer-agent-compute
  ovn-chassis-sriov/1 blocked failed 10.11.2.184 Ready for do-release-upgrade and reboot. Set complete when finished.

juju upgrade-series 9 complete
ERROR machine "9" can not complete, it is either not prepared or already completed

juju upgrade-series 9 prepare jammy -y
ERROR Upgrade series is currently being prepared for machine "9".

Diko Parvanov (dparv)
description: updated
Changed in juju:
importance: Undecided → High
milestone: none → 2.9.43
status: New → Triaged
tags: added: subordinate upgrade-charm
Revision history for this message
Trent Lloyd (lathiat) wrote :

I hit this issue in production with series-upgrade of a Yoga OpenStack cloud from focal->jammy. The issue for us was the hacluster charm. The 2.0.3 charm only supports focal while the 2.4 charm supports both focal+jammy. It still occurs on 2.9.42.

It's not clear from the original description, but the "upgrade-series prepare" does error with the following, after typing yes to confirm the upgrade:
"ERROR charm "hacluster" does not support jammy, force not used"

However the pre-upgrade-series hooks run in the background anyway, even though the juju client exits after that error. Then the hacluster unit goes into the failed state.

In our debugging, db.machineUpgradeSeriesLocks is empty. You can re-run prepare with --force which then creates a lock however the units are still stuck in a failed state. It seems that still leaves the object in a state the transcation won't let happen - perhaps because the hooks already ran.

= Workaround =
If you only attempted the "prepare" on a single unit, you can force-remove that unit, scale it back out, upgrade the hacluster charm and then proceed with a series upgrade. I was not able to find a way to get the broken unit out of the broken state.

juju remove-unit keystone/0 --force
juju add-unit keystone
juju upgrade-charm keystone --channel 2.4/stable

= Reproducer =
You can deploy a simple bundle with keystone and hacluster to reproduce the issue. I have attached the bundle as keystone-focal-yoga.yaml

juju add-model keystone1
juju deploy ./keystone-focal-yoga.yaml
juju upgrade-series 0 prepare jammy

= Expectations =

- This is a critical issue that needs prioritising for a new 2.9.43 release.

- However we also need to determine if we can easily fix this situation as people are very likely to get stuck and removing and scaling the broken unit is very error prone in practice and best avoided if possible.

Revision history for this message
Trent Lloyd (lathiat) wrote :
summary: - Juju should restrict upgrade-series prepare if a subordinate charm
- doesn't support the new series
+ upgrade-series prepare puts units into failed state if a subordinate
+ does not support the target series
Revision history for this message
Trent Lloyd (lathiat) wrote :

I have confirmed that this also happens on a 3.2-beta3 controller with freshly deployed version of the above reproducer

3.2/beta: 3.2-beta3 2023-04-24 (22984) 80MB -

Revision history for this message
Felipe Reyes (freyes) wrote :

Adding a task for the charm-guide since some guidance should be provided in the series upgrade page - https://docs.openstack.org/charm-guide/latest/admin/upgrades/series.html

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 2008509] Re: upgrade-series prepare puts units into failed state if a subordinate does not support the target series

On Tue, 2023-05-09 at 15:46 +0000, Felipe Reyes wrote:
> Adding a task for the charm-guide since some guidance should be provided
> in the series upgrade page - https://docs.openstack.org/charm-
> guide/latest/admin/upgrades/series.html

the kind of guidance should be around making identifying what charms are focal only, what charms are
focal and jammy, and what tracks the users are expected to be using when going from focal to jammy.

PS: be aware of bug https://bugs.launchpad.net/juju/+bug/2008248

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :
Download full text (4.8 KiB)

Hi Felipe,
The bundle above doesn't work for me. Does it require the OpenStack cloud?

unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed Traceback (most recent call last):
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 937, in <module>
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed main()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 930, in main
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed hooks.execute(sys.argv)
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/hookenv.py", line 963, in execute
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed self._hooks[hook_name]()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1896, in wrapped_f
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed return restart_on_change_helper(
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/host.py", line 863, in restart_on_change_helper
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed r = lambda_f()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1897, in <lambda>
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed (lambda: f(*args, **kwargs)),
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/ha-relation-changed", line 603, in ha_changed
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed CONFIGS.write_all()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 325, in write_all
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed self.write(k)
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 313, in write
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed _out = self.render(config_file).encode('UTF-8')
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 273, in render
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed ctxt = ostmpl.context()
unit-keystone-0: 09:36:36 WARNING unit.keystone/0.ha-relation-changed File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/templating.py", line 107, in context
unit-keystone-0: 09:36:36 WAR...

Read more...

Changed in charm-guide:
importance: Undecided → High
status: New → Triaged
Changed in charm-guide:
assignee: nobody → Peter Matulis (petermatulis)
Revision history for this message
Trent Lloyd (lathiat) wrote :

The test bundle is using the hacluster charm, which needs to setup a Virtual IP on the interface (using corosync/pacemaker).

To do this it requires the following options set specific to the deployed environment. Apologies I didn't think about that in my original description.

(1) vip_iface is set to the real interface (for AWS: eth0)
(2) It also requires that all 3 machines have an IP address in the same subnet, and an IP in that subnet is set as the config option "vip". Since AWS uses different subnets for each AZ, we must lock the application to a single region's availability zone.

I made it work on AWS by doing the following

(1) Change the keystone constraints to include "zones=s-east-1b" or any specific AZ (e.g. ap-southeast-2a). For my usage, I also set instance-type=t2.micro as constraints for both mysql-innodb-cluster and keystone.

(2) Find the private VPC subnet for that specific AZ in your AWS console under VPC -> Subnets.. for me it was:
172.31.32.0/20 (172.31.32.0 - 172.31.47.255, you can calculate this with sipcalc)

(3) Modify 'vip' to be an IP in this subnet that is not in use, e.g. in my case I could use: 172.31.47.240

Then the model should deploy and reproduce the issue. Otherwise I'd suggest to create and publish a test subordinate charm with 20.04 support only, and try with that.

Note: This VIP won't actually work in an AWS environment, but that doesn't matter. We just have to convince the charm to deploy so that the juju issue can be reproduced. There is no need to access or use the deployed application.

If that doesn't work I'd suggest publishing a charm that supports 20.04 only. But one without base-specific builds (otherwise you'll separately hit Bug #2008248)

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Hi Diko, Trent,

From my testing, Juju does check the supported series in both principal and subordinate charms before running the `pre-series-upgrade` hook.
So the only way to reproduce this is to run `upgrade-series` command with `--force`.
Can you confirm whether you run the command with --force or not?

Here are the steps I did: https://pastebin.ubuntu.com/p/K66Svt8KpT/

So If you unfortunately run into this situation(the charm's `pre-series-upgrade` hook failed) with --force,
we have to do some workaround to fix it (Juju doesn't support aborting the process currently).

Changed in juju:
milestone: 2.9.43 → 2.9.44
Revision history for this message
Trent Lloyd (lathiat) wrote (last edit ):

In your steps (https://pastebin.ubuntu.com/p/K66Svt8KpT/), after you get this error (without --force):
ERROR series "jammy" not supported by charm "local:focal/lxd-profile-subordinate-0", supported series are: quantal, xenial, bionic, focal

Did you actually check 'juju status'?

My issue, which happens without --force, is that despite giving that error it ran the hooks anyway and then the units get into a bad state. Another 'prepare' without --force then fails.

A second prepare with --force also fails.

It 100% reproduces with my updated bundle in AWS - have tested it.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Hi, I am pretty sure those units are active after `juju upgrade-series 0 prepare jammy`.
Because I was able to run `juju upgrade-series 0 prepare jammy --force`.
If units were in error state, you won't be able to run this step.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Actually, I just tested it again to confirm it.

https://pastebin.ubuntu.com/p/NCXqp7bfzs/

Revision history for this message
Trent Lloyd (lathiat) wrote (last edit ):

Right, it seems for some reason this issue triggers in some cases and not in others.

It seems for whatever reason this lxd-profile-subordinate charm doesn't trigger the issue. But other real charms including nrpe and hacluster, do.

So it would be great if you can try my hacluster example in AWS again, using the details I provided in comment #7.

Then we can figure out why it only triggers with some charms and not othrs.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote (last edit ):

Hi
Appreciate it if you can provide the exact steps with juju status and logs in Pastebin when you can reproduce them.
Just like what I had done above,

Revision history for this message
Trent Lloyd (lathiat) wrote :

Kelvin,

I have again re-created it. I'd like to re-iterate that this is 100% reproducible every time I do it with these charms, and that this also happened in a production environment causing quite an impact.

This issue (among others) will cause problems for anyone upgrading an Ubuntu Openstack from Focal to Jammy which is likely to become more and more common soon. This is a high impact issue, please can you take some extra time to comprehensively try my reproducer? It's fairly simple and only takes about 30 minutes at most, most of that time is waiting.

I understand in the first case you didn't have a matching VIP (that was an oversight on my part), but I have otherwise given very detailed reproduction instructions which you haven't attempted again.

To assist with that, I have modified my original reproducer script to automatically extract and set an appropriate VIP address from "juju subnets" to ensure it will deploy cleanly on anyones AWS environment. You will need to use us-east-1a or otherwise modify that to any other regions specific AZ in all of the commands and the bundle file to use a different AZ. We need to use a specific AZ because the 'vip' config option has to match the VPC subnet of the AZ we deploy into.

I have attached the following:
- lp2008509-aws-reproduction-terminal.txt - complete terminal output of reproducing the issue with the below script
- juju-crashdump-414db7ba-6214-41f7-aed5-85a5edf8f03a.tar.xz - juju-crashdump of the environment after the failed upgrade-machine
- juju-backup-20230704-090223.tar.gz - juju controller backup after the failed upgrade-machine

We need two outcomes
- First we need to develop a workaround for once someone gets into this situation, how to get out of it without removing the broken unit. It's not trivial to remove and replace units in many OpenStack deployments - that's very disruptive - and people seem almost certain to attempt and hit this issue even once it's fixed when they haven't upgraded juju
- We need a fix including backport to 2.9 (this reproduces on 2.9.43, 3.1.2 and 3.2.0 all the same)

# Revised reproducer script
# Requires 'jq' and 'python3' installed

juju bootstrap aws aws --bootstrap-constraints "instance-type=t2.micro arch=amd64 zones=us-east-1a"

juju add-model lp2008509

juju set-model-constraints instance-type=t2.micro arch=amd64 zones="us-east-1a"

VPC_SUBNET=$(juju subnets --format=json | jq -r '.subnets | to_entries[] | select(.value.zones[] == "us-east-1a" and (.value."provider-id" | contains("INFAN") | not)) | .key')

VPC_VIP=$(python3 -c 'import random, ipaddress, sys; print(str(random.choice(list(ipaddress.ip_network(sys.argv[1]).hosts()))))' $VPC_SUBNET)

echo VPC Subnet: ${VPC_SUBNET}, VPC VIP: ${VPC_VIP}

# Note that this file has had the 'vip' field removed, compared to the one originally uploaded. Be sure to remove it from yours.
juju deploy ./keystone-focal-yoga.yaml

juju config keystone vip=${VPC_VIP}

juju wait-for application keystone --query='name=="keystone" && (status=="active" || status=="idle")'

juju status

juju upgrade-machine 0 prepare ubuntu@22.04

Revision history for this message
Trent Lloyd (lathiat) wrote :
Revision history for this message
Trent Lloyd (lathiat) wrote :
Revision history for this message
Trent Lloyd (lathiat) wrote :
Revision history for this message
Trent Lloyd (lathiat) wrote :
Changed in juju:
milestone: 2.9.44 → 2.9.45
Revision history for this message
Erik Lönroth (erik-lonroth) wrote :

Seems I'm hitting this bug aswell.

ERROS from the machine log (lxd):

2023-07-12 15:30:52 ERROR juju.worker.uniter.operation runhook.go:208 error updating workload status before pre-series-upgrade hook: upgrade series status "prepare running"
2023-07-12 15:30:52 ERROR juju.worker.uniter agent.go:31 resolver loop error: executing operation "run pre-series-upgrade hook" for acme/2: upgrade series status "prepare running"
2023-07-12 15:30:52 ERROR juju.worker.dependency engine.go:695 "uniter" manifold worker returned unexpected error: executing operation "run pre-series-upgrade hook" for acme/2: upgrade series status "prepare running"

This is the status

Unit Workload Agent Machine Public address Ports Message
haproxy-dataplane-api/0* active idle 2 192.168.211.177 Ready
  acme/1* active idle 192.168.211.177 Ready
  haproxy-stick-tables-exporter/0* active idle 192.168.211.177 Service running.
haproxy-dataplane-api/1 active idle 5 192.168.211.46 Ready
  acme/2 active failed 192.168.211.46 Ready
  haproxy-stick-tables-exporter/1 active idle 192.168.211.46 Service running.
prometheus/0* active idle 4 192.168.211.82 9090/tcp,12321/tcp Ready

Notice its the Agent that enters "failed" and we can't get out of this error.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Trent Lloyd (lathiat) wrote :

Some follow-up items:
- This needs to be committed to main also (not just 2.9) and all other supported releases (3.x)

- This hopefully also fixes the other series-upgrade bug where it doesn't get the correct charm build but needs confirmation: https://bugs.launchpad.net/juju/+bug/2008248

- We really need a workaround to get out of this state once it happens. Currently we have no way other than removing the unit which is not great in a production scenario - replacing a unit should be easy in theory but in practice is problematic for many reasons - so a workaround to get out of this state would be very helpful for support

- Peter: The OpenStack & other documentation needs a warning that you must upgrade to a fixed version with emphasis :) I would say in general make it a hard requirement to install the latest point release of a supported juju version, regardless of this bug.

Is it too late to get this into 2.9.44? The release status is not totally clear - it has some milestone bugs still tagged in progress currently.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote (last edit ):

Hi Trent,
- the fix was landed in 2.9 which means all 2.9+ will have the fix;
- #2008248 should be fixed as well(let me know if it still has problem);
- what you could do is to purge the unit state in Mongo:
db.unitstates.update({_id: "<model-uuid>:u#<primary-unit>#charm"}, {$unset: {"uniter-state": ""}})
db.unitstates.update({_id: "<model-uuid>:u#<subordinate-unit>#charm"}, {$unset: {"uniter-state": ""}})
# then wait unitl those errors get cleared or just restart agents.
juju upgrade-series 0 prepare jammy --force
juju upgrade-series 0 complete

- yeah, probably 2.9.45

Changed in juju:
status: In Progress → Fix Committed
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Trent Lloyd (lathiat) wrote :

Seems fixed in 2.9.45, 3.1.6, 3.2.2 and will be in 3.3.0 (not yet released)

$ git tag --contains d67fdb888d622c1460bd191e858830fd7253ab43

juju-2.9.45
juju-3.1.6
juju-3.2.2
juju-3.2.3
juju-3.3-beta1
v2.9.45
v3.1.6
v3.2.3
v3.3-beta2

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.