1 Kubernetes CP stays blocked waiting for auth-webhook tokens when connected to keystone

Bug #1978973 reported by Alexander Balderson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
High
George Kraft

Bug Description

Deploying k8s 1.24 stable with keystone latest/stable one unit stays blocked waiting for auth-webhook token.

There are a bunch of errors in the cdk.master.auth-webhook.log and the juju log on the unit (kuberentes-control-plane_0) where it is trying to refresh secrets and then getting an unauthorized.

[2022-06-15 04:12:23 +0000] [102814] [INFO] Refreshing secrets
[2022-06-15 04:12:23 +0000] [102814] [WARNING] Unable to load secrets (1): error: You must be logged in to the server (Unauthorized)

We do see this pass in the lab though, about half the time. I'm wondering if there is a race where one k8s-CP sets up the tokens and then the other unit doesnt get its token in time before auth is required.

Testrun can be found at:
https://solutions.qa.canonical.com/testruns/testRun/5debbdb7-9e09-4fa8-8abe-5db4c373b969
crashdump at:
https://oil-jenkins.canonical.com/artifacts/5debbdb7-9e09-4fa8-8abe-5db4c373b969/generated/generated/kubernetes-aws/juju-crashdump-kubernetes-aws-2022-06-15-04.12.04.tar.gz
bundle at:
https://oil-jenkins.canonical.com/artifacts/5debbdb7-9e09-4fa8-8abe-5db4c373b969/generated/generated/lma-aws/bundle.yaml

All occurrences of this bug can be found at:
https://solutions.qa.canonical.com/bugs/bugs/bug/1978973

crashdump is also attached

Revision history for this message
Alexander Balderson (asbalderson) wrote :
description: updated
Revision history for this message
George Kraft (cynerva) wrote :

The attached crashdump is sadly missing /var/log/syslog on the affected kubernetes-control-plane unit, but I think I see the issue in this test run: https://solutions.qa.canonical.com/testruns/testRun/0d079a03-d84e-4e52-be00-c49d13fd18d9

From kube-apiserver logs:

Jun 14 23:07:09 ip-172-31-42-177 kube-apiserver.daemon[129711]: E0614 23:07:09.224541 129711 webhook.go:154] Failed to make webhook authenticator request: Post "https://172.31.42.177:5000/v1beta1?timeout=30s": x509: certificate signed by unknown authority
Jun 14 23:07:09 ip-172-31-42-177 kube-apiserver.daemon[129711]: E0614 23:07:09.224585 129711 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Post \"https://172.31.42.177:5000/v1beta1?timeout=30s\": x509: certificate signed by unknown authority]"

Looking at foundation.log it seems like vault is initially brought up with an auto-generated root CA certificate, but then a CSR is uploaded, which changes the CA cert. Certificates get reissued, which causes most of the Charmed Kubernetes services to get restarted as they should, but the auth-webhook service never gets restarted and it continues to use the original certificate.

I've seen this before in https://bugs.launchpad.net/bugs/1956482 and I propose the same solution: the certs_changed handler[1] needs to be updated to also restart the cdk.master.auth-webhook service.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/fb4460092b20e1151ee30672f3bdd3e4366717ed/reactive/kubernetes_control_plane.py#L1351

Changed in charm-kubernetes-master:
importance: Undecided → High
assignee: nobody → George Kraft (cynerva)
status: New → In Progress
milestone: none → 1.24+ck1
Revision history for this message
George Kraft (cynerva) wrote :

I can think of two potential workarounds for this.

=== Workaround 1: Don't auto-generate the root CA cert ===

The issue occurs when the vault CA certificate is created, but then later changes. If vault was configured with auto-generate-root-ca-cert=false, then I think the CA would never change, thereby preventing the issue.

=== Workaround 2: Manually restart the auth-webhook service ===

For clusters that are already in the failing state, restarting the cdk.master.auth-webhook service manually should recover it:

juju run --application kubernetes-control-plane -- systemctl restart cdk.master.auth-webhook

Revision history for this message
George Kraft (cynerva) wrote :
George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
tags: added: backport-needed
Adam Dyess (addyess)
tags: removed: backport-needed
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.