vault leader fails on pre-series-upgrade hook, cannont connect to mysql

Bug #2007999 reported by Alexander Balderson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
vault-charm
Triaged
Undecided
Unassigned

Bug Description

While upgrading vault from focal to jammy, the vault leader unit went into an error state preventing the pre-series-upgrade hook from finishing. The HA cluster unit and mysql-innodb-cluster units all show that they are finished running the hook and ready to go, but vault itself says failed because it couldnt connect to mysql:

2023-02-17 17:36:10 ERROR unit.vault/1.juju-log server.go:316 Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-vault-1/charm/reactive/vault_handlers.py", line 959, in publish_ca_info
    if client.is_sealed():
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/v1/__init__.py", line 268, in is_sealed
    return self.seal_status['sealed']
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/v1/__init__.py", line 260, in seal_status
    return self._adapter.get('/v1/sys/seal-status').json()
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 90, in get
    return self.request('get', url, **kwargs)
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 233, in request
    utils.raise_for_error(response.status_code, text, errors=errors)
  File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/utils.py", line 39, in raise_for_error
    raise exceptions.InternalServerError(message, errors=errors)
hvac.exceptions.InternalServerError: dial tcp 127.0.0.1:3306: connect: connection refused

It makes sense it wouldnt be able to connect, since the mysql-router is ready for upgrade, it probably isnt forwarding traffic any longer.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Hi Alex, do you know whether the unit was paused prior to the series-upgrade? Also, does the log have which hook it was that went into error. e.g. update-status or was it the "pre-upgrade-series" hook?

Changed in vault-charm:
status: New → Incomplete
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

This is happening in the kubernetes upgrade testing, for which we use the following automation: https://github.com/charmed-kubernetes/jenkins/blob/b7f0d7537c2687305bf90f2ecee48bbf97ac350e/jobs/integration/utils.py. It looks like it does not pause the unit before the upgrade.

It happens in the post-series-upgrade hook. We see the following status log (from crashdump [0]):
---------------
21 Feb 2023 21:39:38Z juju-unit idle
21 Feb 2023 22:09:00Z workload active Unit is ready (active: true, mlock: enabled)
21 Feb 2023 22:13:05Z juju-unit executing running pre-series-upgrade hook
21 Feb 2023 22:13:07Z workload blocked Ready for do-release-upgrade and reboot. Set complete when finished.
21 Feb 2023 22:13:08Z juju-unit idle
21 Feb 2023 22:31:31Z juju-unit executing running post-series-upgrade hook
21 Feb 2023 22:33:37Z juju-unit error hook failed: "post-series-upgrade"
---------------

Looking at the history of kubernetes upgrade tests, it looks like we only run into this problem when we update the vault leader. If we upgrade a non-leader unit the machine upgrades successfully. Does pausing the unit trigger a leader-elect?

[0] https://oil-jenkins.canonical.com/artifacts/e6d5336b-9863-4347-9ed2-67d23cad2230/generated/generated/kubernetes-maas/juju-crashdump-kubernetes-maas-2023-02-22-01.11.09.tar.gz

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Bas beat me by 44 min :)

it is true that there is no pause run, the automation runs
`juju upgrade-series ... prepare`
and then the upgrade is run (dist-upgrade via juju run)
and then finally `juju upgrade-series complete`

should upgrade-series prepare pause the service? or is a pause actually required before the upgrade?

I also confirmed with the k8s team that their regular upgrade testing does not involve vault in the model, so this is very similar to the issues we've bumped into with the o7k upgrade automation and services that arnt normally run by o7k.

Finally the initial test i filed the bug from did fail during the pre-series-upgrade but most of the failures we see are during the post-series-upgrade. in both cases, mysql-router is paused from the upgrade-series prepare call, so thats why we see the error.

Changed in vault-charm:
status: Incomplete → New
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Alex, Bas: I've triaged this, but I'm (currently) undecided on importance. The unit can be paused, but I've not yet had a chance to test (manually) what happens during series upgrade. The charm also doesn't upgrade successfully from 1.7/stable to 1.8/stable, unless the service is paused (I'm just confirming that now.)

Changed in vault-charm:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.