Just seen this on a K8s Jammy AWS run where the kubernetes control plane/scheduler nodes fail to get a correct certificate from the vault charm, leading to the scheduler not able to query the resources. This leads to all pods being stuck on Pending.
Relevant logs from the scheduler's journal:
Mar 04 06:53:14 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:14.739507 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:21 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:21.255804 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.CSIStorageCapacity: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csistoragecapacities?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:21 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:21.255844 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.CSIStorageCapacity: failed to list *v1.CSIStorageCapacity: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csistoragecapacities?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:22.668905 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.StatefulSet: Get "https://127.0.0.1:6443/apis/apps/v1/statefulsets?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:22.668943 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: Get "https://127.0.0.1:6443/apis/apps/v1/statefulsets?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:22.833359 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.PersistentVolumeClaim: Get "https://127.0.0.1:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:22 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:22.833400 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Get "https://127.0.0.1:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:27.195447 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:27.195484 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:27.823391 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods?fieldSelector=status.phase%21%3DSucceeded%2Cstatus.phase%21%3DFailed&limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:27 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:27.823430 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods?fieldSelector=status.phase%21%3DSucceeded%2Cstatus.phase%21%3DFailed&limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:33 ip-172-31-33-162 kube-scheduler.daemon[138112]: W0304 06:53:33.659088 138112 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.ReplicationController: Get "https://127.0.0.1:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Mar 04 06:53:33 ip-172-31-33-162 kube-scheduler.daemon[138112]: E0304 06:53:33.659124 138112 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: Get "https://127.0.0.1:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Test run: https://solutions.qa.canonical.com/v2/testruns/e4753baa-9a6b-4b3f-ae31-5d6fd3c57064/
Artifacts: https://oil-jenkins.canonical.com/artifacts/e4753baa-9a6b-4b3f-ae31-5d6fd3c57064/index.html
Crashdump: https://oil-jenkins.canonical.com/artifacts/e4753baa-9a6b-4b3f-ae31-5d6fd3c57064/generated/generated/kubernetes-aws/juju-crashdump-kubernetes-aws-2023-03-04-06.52.23.tar.gz
This is a race condition between build_kubeconfig, start_control_ plane, and configure_ apiserver.
In build_kubeconfig, a new client kubeconfig was written[1] with the new CA. Later in build_kubeconfig, it tried to fetch kube-scheduler's token from a secret[2]. Fetching the secret failed:
2023-03-04 02:53:50 INFO unit.kubernetes -control- plane/0. juju-log server.go:316 certificates:55: Executing ['kubectl', '--kubeconfig= /root/. kube/config' , 'get', 'secrets', '-n', 'kube-system', '--field-selector', 'type=juju. is/token- auth', '-o', 'json'] -control- plane/0. certificates- relation- changed logger.go:60 E0304 02:53:50.359454 135532 memcache.go:238] couldn't get current server API group list: Get "https:/ /127.0. 0.1:6443/ api?timeout= 32s": x509: certificate signed by unknown authority -control- plane/0. certificates- relation- changed logger.go:60 E0304 02:53:50.365873 135532 memcache.go:238] couldn't get current server API group list: Get "https:/ /127.0. 0.1:6443/ api?timeout= 32s": x509: certificate signed by unknown authority -control- plane/0. certificates- relation- changed logger.go:60 E0304 02:53:50.369305 135532 memcache.go:238] couldn't get current server API group list: Get "https:/ /127.0. 0.1:6443/ api?timeout= 32s": x509: certificate signed by unknown authority -control- plane/0. certificates- relation- changed logger.go:60 Unable to connect to the server: x509: certificate signed by unknown authority
2023-03-04 02:53:50 WARNING unit.kubernetes
2023-03-04 02:53:50 WARNING unit.kubernetes
2023-03-04 02:53:50 WARNING unit.kubernetes
2023-03-04 02:53:50 WARNING unit.kubernetes
This is because the client kubeconfig had the new CA, but kube-apiserver had not been restarted yet, so it was still serving with a server certificate from the old CA. Since build_kubeconfig could not obtain the secret, it skipped writing a new kubeconfig for kube-scheduler.
During start_control_ plane, the charm restarted kube-scheduler to pick up the new CA. However, since no new kubeconfig had been written for kube-scheduler, it started with the old kubeconfig instead, still using the old CA.
Later, configure_apiserver ran, which restarted kube-apiserver with the new server certificate. This fixed the charm's ability to get secrets, but the damage had already been done. Kube-scheduler was never restarted again.
[1]: https:/ /github. com/charmed- kubernetes/ charm-kubernete s-control- plane/blob/ d9f276f1e54c22f 3f5d739c82f1a3b 5894d140c7/ reactive/ kubernetes_ control_ plane.py# L2151-L2157 /github. com/charmed- kubernetes/ charm-kubernete s-control- plane/blob/ d9f276f1e54c22f 3f5d739c82f1a3b 5894d140c7/ reactive/ kubernetes_ control_ plane.py# L2198-L2206
[2]: https:/