1 Kubernetes CP stays blocked waiting for auth-webhook tokens when connected to keystone
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Kubernetes Control Plane Charm |
Fix Released
|
High
|
George Kraft |
Bug Description
Deploying k8s 1.24 stable with keystone latest/stable one unit stays blocked waiting for auth-webhook token.
There are a bunch of errors in the cdk.master.
[2022-06-15 04:12:23 +0000] [102814] [INFO] Refreshing secrets
[2022-06-15 04:12:23 +0000] [102814] [WARNING] Unable to load secrets (1): error: You must be logged in to the server (Unauthorized)
We do see this pass in the lab though, about half the time. I'm wondering if there is a race where one k8s-CP sets up the tokens and then the other unit doesnt get its token in time before auth is required.
Testrun can be found at:
https:/
crashdump at:
https:/
bundle at:
https:/
All occurrences of this bug can be found at:
https:/
crashdump is also attached
Changed in charm-kubernetes-master: | |
status: | In Progress → Fix Committed |
tags: | added: backport-needed |
tags: | removed: backport-needed |
Changed in charm-kubernetes-master: | |
status: | Fix Committed → Fix Released |
The attached crashdump is sadly missing /var/log/syslog on the affected kubernetes- control- plane unit, but I think I see the issue in this test run: https:/ /solutions. qa.canonical. com/testruns/ testRun/ 0d079a03- d84e-4e52- be00-c49d13fd18 d9
From kube-apiserver logs:
Jun 14 23:07:09 ip-172-31-42-177 kube-apiserver. daemon[ 129711] : E0614 23:07:09.224541 129711 webhook.go:154] Failed to make webhook authenticator request: Post "https:/ /172.31. 42.177: 5000/v1beta1? timeout= 30s": x509: certificate signed by unknown authority daemon[ 129711] : E0614 23:07:09.224585 129711 authentication. go:63] "Unable to authenticate the request" err="[invalid bearer token, Post \"https:/ /172.31. 42.177: 5000/v1beta1? timeout= 30s\": x509: certificate signed by unknown authority]"
Jun 14 23:07:09 ip-172-31-42-177 kube-apiserver.
Looking at foundation.log it seems like vault is initially brought up with an auto-generated root CA certificate, but then a CSR is uploaded, which changes the CA cert. Certificates get reissued, which causes most of the Charmed Kubernetes services to get restarted as they should, but the auth-webhook service never gets restarted and it continues to use the original certificate.
I've seen this before in https:/ /bugs.launchpad .net/bugs/ 1956482 and I propose the same solution: the certs_changed handler[1] needs to be updated to also restart the cdk.master. auth-webhook service.
[1]: https:/ /github. com/charmed- kubernetes/ charm-kubernete s-control- plane/blob/ fb4460092b20e11 51ee30672f3bdd3 e4366717ed/ reactive/ kubernetes_ control_ plane.py# L1351