Kubernetes Control Plane Charm

Control plane crashloop due to cert regeneration caused by inconsistent SANs

Bug #2022859 reported by Arun Neelicattu on 2023-06-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes API Load Balancer	New	Undecided	Unassigned
	Kubernetes Control Plane Charm	New	Undecided	Unassigned

Bug Description

Observed Behaviour
------------------
1. kubernetes-control-plane units never get to active/idle state.
2. kubernetes-control-plane units continuously reacts to client relation / config change events causing restarts.
3. easyrsa units detect client relation changes triggering certification revocation and generates new certificates with different SANs.

Would be great if someone has a workaround for this that we can use.

Probable Root Cause
-------------------
When multiple DNS records exists for the host, python's socket.getfqdn() calls will provide inconsistent results between calls due to https://github.com/python/cpython/issues/49254.

For example the following commands were consecutively executed on one of the control plane machines.

root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())'
10-XXX-XXX-XXX.example.net
root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())'
juju-bf6a17-0.example.net

This causes the SAN list generated for the certificate request to be different every time. https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7258630cf0a5560a665ed1e4770cc8f2f52013c4/reactive/kubernetes_control_plane.py#L1570C7-L1588

This then triggers a certificate change and the cycle continues.

Proposed Fix
------------
A potential fix here could be to replace the call to `socket.getfqdn()` to the patched method in the cpython upstream issue. For convenience, I have put the code into a gist at https://gist.github.com/abn/c4165a6d288e5f7137bdec5a4db199d1.

Alternatively, you can simply replace the call with the following.

socket.getaddrinfo(socket.gethostname(), None, 0, socket.SOCK_DGRAM, 0, socket.AI_CANONNAME)

Ideally the fix can be replicated and/or reused across all the charms that request for certs.

Workaround
----------
Should adjust the command to your specific case.

juju exec --all -- bash -c 'sudo sed -i s/"127.0.0.1 localhost"/"127.0.0.1 $(hostname -f) localhost"/ /etc/hosts'

Since the hosts file has precedence, this seems to have at least mitigated the issue for now.

See original description

Arun Neelicattu (arun-neelicattu) on 2023-06-04

description:

updated

Revision history for this message

Arun Neelicattu (arun-neelicattu) wrote on 2023-06-04:

Possibly related bug: https://bugs.launchpad.net/maas/+bug/2012801

Arun Neelicattu (arun-neelicattu) on 2023-06-04

description:

updated

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.