Unit needs restart after certificate change

Bug #1903077 reported by Martin Kalcok
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Triaged
Medium
Unassigned

Bug Description

kubernetes-master unit needs to be restarted after PKI change.

Even though the unit runs 'certificates-relation-changed' hook when certificate changes, and it will eventually settle in the active/idle state, it wont function properly. New pods will be stuck in the 'pending' state until user manually restarts unit (using actions).

Steps to reproduce:

* Deploy kubernetes-core bundle
* Wait for it to settle
* Remove Easyrsa application
* Deploy bundle again (this will redeploy easyrsa and generates new PKI)
* Deploy some kubernetes pods
* Observe pods being stuck in the 'Pending' state
* Run kubernetes-worker action 'restart'
* Observe pods getting properly deployed

Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report and reproduction steps. I can reproduce this, although it appears to be a race condition so it might not reproduce with 100% certainty.

In my case, on kubernetes-master, both kube-controller-manager and kube-scheduler were failing to reach kube-apiserver due to "x509: certificate signed by unknown authority". This occurred because build_kubeconfig[1] ran before store_ca[2] and ca_written[3]. So while the charm did detect the change and restart services, it did so using kubeconfigs that were rendered with the old CA. On the next hook, it re-ran build_kubeconfig and rendered new kubeconfigs with the correct CA, but did not restart services.

To fix this, the charm's handling of the tls_client.ca.written flag will need to be adjusted to ensure new kubeconfigs are rendered before restarting the services.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/1467e9ba8332c2959dd8f908aa29cee18f90e540/reactive/kubernetes_master.py#L1912
[2]: https://github.com/charmed-kubernetes/layer-tls-client/blob/9bfaafcd15ecdbfb435fd35c28057372f7d62e2b/reactive/tls_client.py#L19
[3]: https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/1467e9ba8332c2959dd8f908aa29cee18f90e540/reactive/kubernetes_master.py#L1159

Changed in charm-kubernetes-master:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Xav Paice (xavpaice) wrote :

I just found a case where the certificates in use had expired. I replaced them using Vault actions:

juju run-action --wait vault/leader reissue-certificates

The certificate files were updated, but there were still processes left behind running the old certificates, therefore kube-controller-manager was falling over plus I couldn't use kubectl.

I eventually found a bunch of processes like this:

/var/lib/juju/agents/unit-kubernetes-master-0/.venv/bin/python3 /var/lib/juju/agents/unit-kubernetes-master-0/charm/../.venv/bin/gunicorn --bind 1.2.4.5:5000 --capture-output --certfile /root/cdk/server.crt --disable-redirect-access-to-syslog --error-logfile auth-webhook.log --keyfile /root/cdk/server.key --log-level debug --pid auth-webhook.pid --workers 13 --worker-class aiohttp.worker.GunicornWebWorker auth-webhook:app

That turned out to be the cdk.master.auth-webhook.service which also needed a restart after the certificates had changed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.