test_clusterip_service_endpoint fails, kube-proxy stuck with stale CA

Bug #1999676 reported by Alexander Balderson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Testing
Triaged
High
Unassigned

Bug Description

While testing k8s 1.26 on aws with jammy and calico the test_clusterip_service_endpoint test failed to connect to to the hello-world pod, connect timed out.

Looking at the status of the hello-world pod, k8s-cp 0 seems to be the one having problems launching the pod.

it starts creating the pod:
Dec 13 20:40:49 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:49.442691 126430 replica_set.go:571] "Too few replicas" replicaSet="default/hello-world-7b8bb957c6" need=1 creating=1

and then is able to bind the pod to the node after a few more POSTs/PUTs:
Dec 13 20:40:49 ip-172-31-34-74 kube-scheduler.daemon[122674]: I1213 20:40:49.459258 122674 schedule_one.go:252] "Successfully bound pod to node" pod="default/hello-world-7b8bb957c6-2gll9" node="ip-172-31-36-18.ec2.internal" evaluatedNodes=5 feasibleNodes=3

and then comes back with:
Dec 13 20:40:49 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:49.463994 126430 deployment_controller.go:491] "Error syncing deployment" deployment="default/hello-world" err="Operation cannot be fulfilled on deployments.apps \"hello-world\": the object has been modified; please apply your changes to the latest version and try again"

a few more successful POSTs and GETs and then everything comes back and destroys the pod:
Dec 13 20:40:53 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:53.844319 126430 replica_set.go:607] "Too many replicas" replicaSet="default/hello-world-7b8bb957c6" need=0 deleting=1
Dec 13 20:40:53 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:53.844361 126430 replica_set.go:235] "Found related ReplicaSets" replicaSet="default/hello-world-7b8bb957c6" relatedReplicaSets="[default/hello-world-7b8bb957c6 default/hello-world-8d968cd7b]"
Dec 13 20:40:53 ip-172-31-34-74 kube-controller-manager.daemon[126430]: I1213 20:40:53.844429 126430 controller_utils.go:592] "Deleting pod" controller="hello-world-7b8bb957c6" pod="default/hello-world-7b8bb957c6-2gll9"

I'm not quite sure what would have caused this error, it could have been a one off.

testrun at:
https://solutions.qa.canonical.com/v2/testruns/7ce07815-6d4e-4866-af87-40c190554bd2/
crashdump at:
https://oil-jenkins.canonical.com/artifacts/7ce07815-6d4e-4866-af87-40c190554bd2/generated/generated/kubernetes-aws/juju-crashdump-kubernetes-aws-2022-12-13-20.43.43.tar.gz
bundle at:
https://oil-jenkins.canonical.com/artifacts/7ce07815-6d4e-4866-af87-40c190554bd2/generated/generated/kubernetes-aws/bundle.yaml

Revision history for this message
George Kraft (cynerva) wrote :

The kube-proxy service on kubernetes-control-plane/1 seems to have gotten stuck in a bad state with x509 errors:

Dec 13 20:44:56 ip-172-31-33-85 kube-proxy.daemon[135314]: W1213 20:44:56.142414 135314 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Service: Get "https://172.31.33.85:6443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Dec 13 20:44:56 ip-172-31-33-85 kube-proxy.daemon[135314]: E1213 20:44:56.142478 135314 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.31.33.85:6443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": x509: certificate signed by unknown authority

172.31.33.85 is also kubernetes-control-plane/1, so, kube-apiserver and kube-proxy on the same host disagree about the CA.

This is a race condition. Timeline:

20:22:22 foundation.log: vault reissue-certificates action finishes
20:28:30 kcp1: certificates-relation-changed hook starts
20:28:46 kcp1: build_kubeconfig handler writes new kubeconfig for kube-proxy
20:28:53 kcp1: store_ca handler writes new CA to disk
20:29:05 kcp1: start_control_plane handler restarts kube-proxy
20:31:47 kcp1: certificates-relation-changed hook finishes

The kubeconfig written by build_kubeconfig had the old CA, because store_ca had not run yet. So when kube-proxy got restarted, it had the old CA in its kubeconfig. No further restarts of kube-proxy occurred, so it was stuck with the old CA config forever.

Changed in charmed-kubernetes-testing:
importance: Undecided → High
status: New → Triaged
Revision history for this message
George Kraft (cynerva) wrote :

About the "Error syncing deployment" log message from kube-controller-manager, I think that's a fairly common transient error and the pod does appear to have come up fine in the end. With kube-proxy not working properly on kubernetes-control-plane/1, though, it could not forward traffic from the service IP to the pod.

summary: - test_clusterip_service_endpoint fails, hello-world fails to start
- because the object has been modified
+ test_clusterip_service_endpoint fails, kube-proxy stuck with stale CA
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.