ocn rev 105 Unable to authorize approle after unseal

Bug #1889654 reported by Alexander Balderson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
vault-charm
Fix Released
Critical
David Ames

Bug Description

https://solutions.qa.canonical.com/qa/testRun/18df9190-0e39-4cd1-b32d-3229b1248a3f

After initializing all 3 vault units. the Vault units go into a bad state, unable to authorize approle, and are unable to generate the authorization token.

2020-07-30 08:24:37 DEBUG juju-log Could not retrieve app_role_id
2020-07-30 08:24:37 DEBUG jujuc server.go:211 running hook tool "juju-log"
2020-07-30 08:24:37 WARNING juju-log InternalServerError: Unable to athorize approle. This may indicate failure to communicate with the database
2020-07-30 08:24:37 DEBUG jujuc server.go:211 running hook tool "juju-log"
2020-07-30 08:24:37 ERROR juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-0/charm/reactive/vault_handlers.py", line 717, in client_approle_authorized
    vault.get_local_client()
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/tenacity/__init__.py", line 329, in wrapped_f
    return self.call(f, *args, **kw)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/tenacity/__init__.py", line 409, in call
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/tenacity/__init__.py", line 368, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/tenacity/__init__.py", line 186, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/tenacity/__init__.py", line 412, in call
    result = fn(*args, **kwargs)
  File "/var/lib/juju/agents/unit-vault-0/charm/lib/charm/vault.py", line 250, in get_local_client
    raise VaultNotReady("Cannot initialise local client")
lib.charm.vault.VaultNotReady: Vault is not ready (Cannot initialise local client)

2020-07-30 08:24:37 DEBUG jujuc server.go:211 running hook tool "status-set"
2020-07-30 08:24:37 DEBUG jujuc server.go:211 running hook tool "relation-set"

summary: - ocn rev 105 Unable to athorize approle after unseal
+ ocn rev 105 Unable to authorize approle after unseal
tags: added: cdo-qa cdoqa-release-blocker foundations-engine
Changed in vault-charm:
importance: Undecided → Critical
assignee: nobody → Alex Kavanagh (ajkavanagh)
tags: added: cdo-release-blocker
removed: cdoqa-release-blocker
Revision history for this message
David Ames (thedac) wrote :

The following introduced a gate health check, client_approle_authorized, to handle database topology changes (rolling restarts, pause/resumes, etc). It checks that the local charm can authorize itself. A tenacity retry is also added.
https://review.opendev.org/#/c/740086/
https://review.opendev.org/#/c/739129/

These changes have caused delays during long update-status hook executions.

As it turns out SQA uses a 5 minute TTL on their token create:
juju run -u vault/leader 'export VAULT_TOKEN=<token> && export VAULT_ADDR=http://127.0.0.1:8200 && /snap/bin/vault token create --ttl=5m'

If an update-status is running ahead of the above juju run we get a time out on the juju run.
If update-status runs after this but before the action, authorize-charm is run we get the following "permission denied" error because the token TTL has been exceeded.

2020-08-02 06:18:05 ERROR juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-0/charm/actions/authorize-charm", line 185, in main
    action(args)
  File "/var/lib/juju/agents/unit-vault-0/charm/actions/authorize-charm", line 45, in authorize_charm_action
    role_id = vault.setup_charm_vault_access(action_config['token'])
  File "/var/lib/juju/agents/unit-vault-0/charm/lib/charm/vault.py", line 213, in setup_charm_vault_access
    enable_approle_auth(client)
  File "/var/lib/juju/agents/unit-vault-0/charm/lib/charm/vault.py", line 178, in enable_approle_auth
    if 'approle/' not in client.list_auth_backends():
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/hvac/v1/__init__.py", line 1738, in list_auth_backends
    return self._adapter.get('/v1/sys/auth').json()
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/hvac/adapters.py", line 90, in get
    return self.request('get', url, **kwargs)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/hvac/adapters.py", line 233, in request
    utils.raise_for_error(response.status_code, text, errors=errors)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/hvac/utils.py", line 33, in raise_for_error
    raise exceptions.Forbidden(message, errors=errors)
hvac.exceptions.Forbidden: permission denied

Root cause:
The gate, client_approle_authorized, checking for app role authorization is called before the charm has been authorized causing tenacity retries and long update-state hook executions ultimately exceeding the 5 minute token TTL.

TRIAGE:
Add a check in client_approle_authorized for leader setting of local-charm-access-id which is set during the authorize-charm action.

Changed in vault-charm:
status: New → Triaged
assignee: Alex Kavanagh (ajkavanagh) → David Ames (thedac)
milestone: none → 20.08
Revision history for this message
David Ames (thedac) wrote :

My bug fix: https://review.opendev.org/#/c/744536/

Alex and I had a further discussion about the vault charm that I will summarize here. The changes mentioned in comment #1 were necessary because the charm runs state changing functions during update-status hook executions. Another approach might be to revert the original changes and stop the charm from executing these functions during an update-status hook. This might also be a good 20.10 goal for the charm.

Changed in vault-charm:
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

This [0] has landed and needs to be verified by SQA.

[0] https://review.opendev.org/#/c/744536/

Changed in vault-charm:
status: In Progress → Fix Committed
Changed in vault-charm:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.