Switching charmhub channels with default config for the snap version can lead to charm errors (was: Upgrading vault from charmstore stable to 1.7/stable fails)

Bug #2007587 reported by Diko Parvanov
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Charm Guide
Fix Released
Undecided
Robert Gildein
vault-charm
Fix Committed
High
Robert Gildein
1.5
In Progress
Undecided
Robert Gildein
1.6
In Progress
Undecided
Robert Gildein
1.7
In Progress
Undecided
Robert Gildein
1.8
In Progress
Undecided
Robert Gildein

Bug Description

Upgrading the vault charm from charmstore latest stable revision 54 to ch:vault 1.7/stable fails

The vault channel was configured as stable on the charmstore charm, which apparently goes to tracking: 1.10/stable

Changing the charm to 1.7/stable changes also the tracking to 1.7/stable, trying to downgrade manually (I don't think this is a good idea at all):

snap refresh --amend --channel=1.7/stable vault
error: cannot refresh "vault": snap "vault" has running apps (vault), pids: 212762

service vault stop followed by the refresh command works fine.

Traceback:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-vault-0/charm/reactive/vault_handlers.py", line 188, in snap_refresh
    snap.refresh('vault', channel=channel)
  File "/var/lib/juju/agents/unit-vault-0/charm/lib/charms/layer/snap.py", line 113, in refresh
    _refresh_store(snapname, **kw)
  File "/var/lib/juju/agents/unit-vault-0/charm/lib/charms/layer/snap.py", line 421, in _refresh_store
    out = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['snap', 'refresh', '--amend', '--channel=1.7/stable', 'vault']' returned non-zero exit status 1.

testing:

juju deploy cs:vault
Located charm "vault" in charm-store, revision 54

juju config vault channel
stable

juju ssh vault/0 snap info vault | grep tracking
tracking: 1.10/stable

Tags: bseng-1077
Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 2007587] [NEW] Upgrading vault from charmstore stable to 1.7/stable fails

>
> The vault channel was configured as stable on the charmstore charm,
> which apparently goes to tracking:     1.10/stable

cs:vault tracks snap 'stable' channel, vault at the moment has the track '1.10' set as default
(insteado of 'latest').

Diko Parvanov (dparv)
description: updated
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote : Re: Upgrading vault from charmstore stable to 1.7/stable fails

This is an awkward outcome of the way the charmhub "works" with tracks and the OpenStack's original plan of keeping the 'latest/stable' track as compatible with Juju 2.8 which was not track-aware, and originally worked with the charmstore (predecessor of the charmhub).

Thus, as @freyes mentions, the latest/stable version of the charm is in a juju 2.8 compatible "old" 21.10 charm. Essentially, people should never install latest/stable, as in this case, it caused an installation of vault 1.10 which the 21.10 charm was never designed to install (so it's dumb luck that it worked).

Note that this is only a problem on focal and older. On jammy (base 22.04) there is not charm available on latest/stable.

The bug, as @freyes, points out, is that the latest/stable charm tracks the 'stable' snap channel for vault; what is very curious, is that, IIRC, it also doesn't allow upgrades to the snap. It prevents it from upgrading/restarting and so locks the version of the snap to the one it installed with. Unless just testing vault, the config value should be set to an explicit track (which is done automatically with the charmhub tracks) so that unexpected upgrades don't occur.

Anyway, we need to work out a strategy for what to do with the charmcraft tracks; I think we should set the default to something other than "latest".

Revision history for this message
Diko Parvanov (dparv) wrote :

latest/stable is what most people running still on charms from charmstore will be on, and they will all have to migrate to charmhub. We have *not* used latest/stable from charmhub, but latest from charmstore.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I can understand the frustration; it's quite tricky to sort these things out.

> We have *not* used latest/stable from charmhub, but latest from charmstore.

If you've deployed using cs:vault using juju 2.8 in the last 12 to 18 months then it *is* from the charmhub, just via a compatibility layer. And as such that will from the latest/stable track. So if you got vault 1.10 with the latest/stable charm it was probably a recent install, and would have been on focal, as there is no latest/stable vault charm that supports jammy.

As previously noted, we will need to sort out the latest/stable channel as it doesn't support the current lts.

Revision history for this message
Diko Parvanov (dparv) wrote :

And one more thing here is that upgrade-charm hook fails to 1.7/stable and even from 1.7/stable to 1.8/stable with:

unit-vault-0: 06:48:27 WARNING unit.vault/0.upgrade-charm File "/var/lib/juju/agents/unit-vault-0/charm/lib/charms/layer/snap.py", line 421, in _refresh_store
unit-vault-0: 06:48:27 WARNING unit.vault/0.upgrade-charm out = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
unit-vault-0: 06:48:27 WARNING unit.vault/0.upgrade-charm File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
unit-vault-0: 06:48:27 WARNING unit.vault/0.upgrade-charm return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
unit-vault-0: 06:48:27 WARNING unit.vault/0.upgrade-charm File "/usr/lib/python3.10/subprocess.py", line 524, in run
unit-vault-0: 06:48:27 WARNING unit.vault/0.upgrade-charm raise CalledProcessError(retcode, process.args,
unit-vault-0: 06:48:27 WARNING unit.vault/0.upgrade-charm subprocess.CalledProcessError: Command '['snap', 'refresh', '--amend', '--channel=1.8/stable', 'vault']' returned non-zero exit status 1.
unit-vault-0: 06:48:27 ERROR juju.worker.uniter.operation hook "upgrade-charm" (via explicit, bespoke hook script) failed: exit status 1

so the charm needs to stop the vault services first to change the snap channel. So that's a charm bug for sure.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

For future travellers:

Pre charmhub, upgrade-charm never had the possibility of changing the snap channel; that was done via config changed.

For charmhub charms, however, charm-upgrade can take two forms:

* an 'in channel' refresh where just the charm is upgraded, but no channel is switched.
* a 'change channel' refresh where the charm is changed AND the channel is switched.

The issue with the 2nd one is that the defaults for the config items *may* change; in this case the default snap channel changes with the charm channel, and this causes a snap refresh.

The issue here is that restarting vault will seal the unit, and if all the units are sealed (e.g. during a charm refresh that change the channel and the default is being used), then the whole vault application will become sealed which would almost certainly cause a control plane outage of the vault using apps (e.g. on OpenStack, OVN and the api charms).

What we may want to do is to detect the change in the default config (snap version) but not actually do the refresh, but instead put something on the status line and enter the blocked state to indicate that the default version for the charm and the installed version are different - this won't happen with a user configured version as that won't change during the refresh.

Then to resolve this situation each unit could be refreshed, and then unsealed, progressively, preventing the outage.

summary: - Upgrading vault from charmstore stable to 1.7/stable fails
+ Switching charmhub channels with default config for the snap version can
+ lead to charm errors (was: Upgrading vault from charmstore stable to
+ 1.7/stable fails)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Triage to high; the work-around should be (but I've not tested it) to set the config for the snap version prior to doing a switch and then the underlying snap won't be changed during the refresh. However, then the charmhub channel won't match the underlying snap version.

Changed in vault-charm:
importance: Undecided → High
status: New → Triaged
Andrea Ieri (aieri)
tags: added: bseng-1077
Revision history for this message
Billy Olsen (billy-olsen) wrote :

There are plans in Juju to treat snap resources for machine operators as it does OCI resources for kubernetes operators - that is, there is a 1:1 mapping between the charm revision and the resource revision. This feels like the most appropriate way to do this from the charm.

The current problem is that there is just a broad disconnect between the snap and charm stores/revisions. It is far better to have the charm/snap pairing and deliver it that way. That's a future juju improvement.

The workaround should work, as Alex referred to in comment #7, but now has the problem that the payload and the charm differ. That's actually probably okay for the time being.

Changed in vault-charm:
assignee: nobody → Mert Kirpici (mertkirpici)
Revision history for this message
Mert Kirpici (mertkirpici) wrote (last edit ):

A couple of observations and ideas:

- First of all I verified that the workaround in #7 works. Here is the procedure I followed to perform a successful upgrade.

  juju config vault channel=stable
  juju refresh vault --channel 1.7/stable --switch ch:vault # charm reports active idle, however snap refresh did not run. snap channel inconsistent with charm channel.
  juju run-action --wait vault/leader pause
  juju config vault --reset channel # snap refresh succeeds this time. no more inconsistency. vault service starts without resume action. unit reports sealed.
  # Unseal the vault. Unit reports active/idle.

- As Diko mentioned in the report, the unhandled exception arises from the `snap refresh` operation returning a non-zero exit code due to the existence of a running vault process in the system. Which I believe is the documented behavior of snaps. This means that the "channel" config option itself is broken when run against a running vault service regardless of a charm upgrade scenario(verified that this _is_ the case).

- In an upgrade scenario, even if the charm detects this change, refuses to refresh the snap and informs the operator to step in and refresh the snap manually(via an action perhaps?); at the end of the day, the vault will need to be unsealed again by the operator who has triggered the change in the first place.

- In light of the above, I propose to fix this behavior by actually restarting the vault service during config-changed and upgrade-charm hooks if the config option changes, resulting in a sealed vault unit. I believe if documented properly in the charm config, this is the cleanest way to go since we can not avoid re-sealing one way or the other and this fix will not introduce inconsistency between snaps and charm channels.

Changed in vault-charm:
status: Triaged → In Progress
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I tend to agree; if the snap refresh code determines that the snap needs to be refreshed (changed version), then stop the service and then refresh the snap. This will result in a sealed unit, but is more correct behaviour than the charm going into an error state.

Other options such as blocking just means that the operator will need to take another action just as pausing the charm, which would then allow the unit to snap refresh, which would then restart the service and seal the unit, making it a two step process.

Therefore, we have two options in the snap refresh code:

1. detect if the process is running, stop it, and then allow the refresh to continue.

2. detect if the process is running, don't stop it, and go into a blocked state showing that the process is running and that the snap would be refreshed due to configuration differences. Then, when the unit is paused, the refresh and seal would then happen.

2. may be preferable, but is more complicated to communication and understand, I think.

Changed in vault-charm:
assignee: Mert Kirpici (mertkirpici) → Robert Gildein (rgildein)
Revision history for this message
Robert Gildein (rgildein) wrote :

Hi @ajkavanagh I proposed the solution here [1], where I used blocked
state and added action to refresh snap manually so there will be no outage.
Here [2] is how I tested it.

---
[1]: https://review.opendev.org/c/openstack/charm-vault/+/880858
[2]: https://pastebin.ubuntu.com/p/JrTGpdWQdP/

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Thanks for taking a look at this @rgildein. We had a fairly long discussion about what makes most sense for this charm, and it turns out there are a number of competing ideas about what should be done! However, no consensus was formed. It turns out that your patch linked from #11 is one of the options. Thus, I'm going to write up the various options and then post to the discourse on charmhub.io and links to a couple of mailing lists to gather feedback/guesstimate the best course of action.

Revision history for this message
Robert Gildein (rgildein) wrote :

Thanks Alex for information. The solution I provided seems reasonable to me, but
at the same time it may not be entirely desirable.
When you create this discussion, can you add me to it or add a link here? Thanks

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

So, a further update here. After some more discussions, it was decided that we should drop the action idea and shift to "if the operator does a config changed or charm upgrade (refresh), then they know what they are doing and so it is okay to stop the service, do the snap refresh, and then restart the service in a sealed state."

Obviously, this will happen across all the units, so the entire vault application's units will be sealed at that point, and will require the operator to unseal the vault.

So, we should definitely ensure that there is a warning in the config.yaml of the charm, and also update the charm-guide to include some commentary around vault charm upgrades.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This should be fixed by: https://review.opendev.org/c/openstack/charm-vault/+/880858 - currently only on latests/edge channel.

Changed in vault-charm:
status: In Progress → Fix Committed
Changed in charm-guide:
status: New → In Progress
assignee: nobody → Robert Gildein (rgildein)
Revision history for this message
Peter Matulis (petermatulis) wrote (last edit ):
Changed in charm-guide:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.