kubectl snap creates thousands of hanging systemd scope units
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Kubernetes Control Plane Charm |
Incomplete
|
Undecided
|
Unassigned | ||
snapd |
New
|
Undecided
|
Unassigned |
Bug Description
We found over 8.5k snap kubectl systemd scope units on each of our Kubernetes master nodes. This causes 100% CPU usage spikes caused by systemd and /sbin/init processes hosing the entire cluster.
$ sudo systemctl list-units --type scope | grep snap | wc -l
8643
Typical entries look like these:
snap.
snap.
snap.
snap.
snap.
snap.
snap.
snap.
Please note that all of them are in status active/running.
After manually stopping them using this one-liner:
sudo systemctl list-units --type scope | grep kubectl | awk '{print $1}' | xargs sudo systemctl stop
The number goes down to expected values:
sudo systemctl list-units --type scope | grep snap | wc -l
11
And the system becomes much snappier again. The increased load caused by this issue, causes transient failures in communication between API servers and kubelets, resulting in errors similar to this: [0] Then we end up restarting kubelets which is the only way to restore connectivity between kubelets and API servers.
Additionally we see a lot of similarities with this bug [1] reported for etcdctl. Both kubectl and etcdctl from that bug are running as snaps, leaving thousands of systemd scope units, slowing down the system.
Versions:
kubernetes-master charm: 1.18.15 charm revision: 895
Ubuntu: 18.04.5 LTS
kubectl snap: 1.18.15 1.18/stable
$ snap --version
snap 2.50
snapd 2.50
series 16
ubuntu 18.04
kernel 5.4.0-1046-azure
[0] https:/
[1] https:/
Changed in charm-kubernetes-master: | |
status: | Incomplete → New |
We just found the issue on another Kubernetes cluster, this time there were 13299 leaked systemd scopes on one of the masters making the system even more unresponsive. The other two master were at ~8k and were slightly more responsive so there appears to be a correlation between number of leaked scopes and general system responsiveness.