Kubernetes Control Plane Charm

Unit needs restart after certificate change

Bug #1903077 reported by Martin Kalcok on 2020-11-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Control Plane Charm	Triaged	Medium	Unassigned

Bug Description

kubernetes-master unit needs to be restarted after PKI change.

Even though the unit runs 'certificates-relation-changed' hook when certificate changes, and it will eventually settle in the active/idle state, it wont function properly. New pods will be stuck in the 'pending' state until user manually restarts unit (using actions).

Steps to reproduce:

* Deploy kubernetes-core bundle
* Wait for it to settle
* Remove Easyrsa application
* Deploy bundle again (this will redeploy easyrsa and generates new PKI)
* Deploy some kubernetes pods
* Observe pods being stuck in the 'Pending' state
* Run kubernetes-worker action 'restart'
* Observe pods getting properly deployed

Revision history for this message

George Kraft (cynerva) wrote on 2020-11-09:

Thanks for the report and reproduction steps. I can reproduce this, although it appears to be a race condition so it might not reproduce with 100% certainty.

In my case, on kubernetes-master, both kube-controller-manager and kube-scheduler were failing to reach kube-apiserver due to "x509: certificate signed by unknown authority". This occurred because build_kubeconfig[1] ran before store_ca[2] and ca_written[3]. So while the charm did detect the change and restart services, it did so using kubeconfigs that were rendered with the old CA. On the next hook, it re-ran build_kubeconfig and rendered new kubeconfigs with the correct CA, but did not restart services.

To fix this, the charm's handling of the tls_client.ca.written flag will need to be adjusted to ensure new kubeconfigs are rendered before restarting the services.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/1467e9ba8332c2959dd8f908aa29cee18f90e540/reactive/kubernetes_master.py#L1912
[2]: https://github.com/charmed-kubernetes/layer-tls-client/blob/9bfaafcd15ecdbfb435fd35c28057372f7d62e2b/reactive/tls_client.py#L19
[3]: https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/1467e9ba8332c2959dd8f908aa29cee18f90e540/reactive/kubernetes_master.py#L1159

Changed in charm-kubernetes-master:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Xav Paice (xavpaice) wrote on 2022-02-19:

I just found a case where the certificates in use had expired. I replaced them using Vault actions:

juju run-action --wait vault/leader reissue-certificates

The certificate files were updated, but there were still processes left behind running the old certificates, therefore kube-controller-manager was falling over plus I couldn't use kubectl.

I eventually found a bunch of processes like this:

/var/lib/juju/agents/unit-kubernetes-master-0/.venv/bin/python3 /var/lib/juju/agents/unit-kubernetes-master-0/charm/../.venv/bin/gunicorn --bind 1.2.4.5:5000 --capture-output --certfile /root/cdk/server.crt --disable-redirect-access-to-syslog --error-logfile auth-webhook.log --keyfile /root/cdk/server.key --log-level debug --pid auth-webhook.pid --workers 13 --worker-class aiohttp.worker.GunicornWebWorker auth-webhook:app

That turned out to be the cdk.master.auth-webhook.service which also needed a restart after the certificates had changed.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.