2021-05-18 14:30:19 |
Przemyslaw Lal |
bug |
|
|
added bug |
2021-05-18 14:30:19 |
Przemyslaw Lal |
attachment added |
|
Full list of leftover systemd scopes https://bugs.launchpad.net/bugs/1928806/+attachment/5498496/+files/scope_bug.txt |
|
2021-05-18 14:31:44 |
Przemyslaw Lal |
description |
We found over 8.5k snap kubectl systemd scope units on each of our Kubernetes master nodes. This causes 100% CPU usage spikes caused by systemd and /sbin/init processes hosing the entire cluster.
$ sudo systemctl list-units --type scope | grep snap | wc -l
8643
Typical entries looks like this:
snap.kubectl.kubectl.001b8133-30f5-42c3-90e4-39da4229cbef.scope loaded active running snap.kubectl.kubectl.001b8133-30f5-42c3-90e4-39da4229cbef.scope
snap.kubectl.kubectl.001e4ffa-17c6-49ea-8fd2-1d7d07fb0e2f.scope loaded active running snap.kubectl.kubectl.001e4ffa-17c6-49ea-8fd2-1d7d07fb0e2f.scope
snap.kubectl.kubectl.0022dadb-f7b5-456f-ac51-98cfcd128035.scope loaded active running snap.kubectl.kubectl.0022dadb-f7b5-456f-ac51-98cfcd128035.scope
snap.kubectl.kubectl.0026e64c-8794-47d9-b58c-a05759814dbd.scope loaded active running snap.kubectl.kubectl.0026e64c-8794-47d9-b58c-a05759814dbd.scope
snap.kubectl.kubectl.002d028f-b30c-4ce7-b6b9-4c72ef74f308.scope loaded active running snap.kubectl.kubectl.002d028f-b30c-4ce7-b6b9-4c72ef74f308.scope
snap.kubectl.kubectl.003584be-b585-48b9-8385-fca6a642433a.scope loaded active running snap.kubectl.kubectl.003584be-b585-48b9-8385-fca6a642433a.scope
snap.kubectl.kubectl.0042ab28-ac94-476b-877d-9cde2d5e2312.scope loaded active running snap.kubectl.kubectl.0042ab28-ac94-476b-877d-9cde2d5e2312.scope
snap.kubectl.kubectl.00537acd-ce13-4a9c-a70d-059d7d1088dc.scope loaded active running snap.kubectl.kubectl.00537acd-ce13-4a9c-a70d-059d7d1088dc.scope
Please note that all of them are in status active/running.
After manually stopping them using this one-liner:
sudo systemctl list-units --type scope | grep kubectl | awk '{print $1}' | xargs sudo systemctl stop
The number goes down to expected values:
sudo systemctl list-units --type scope | grep snap | wc -l
11
And the system becomes much snappier again. The increased load caused by this issue, causes transient failures in communication between API servers and kubelets, resulting in errors similar to this: [0] Then we end up restarting kubelets which is the only way to restore connectivity between kubelets and API servers.
Additionally we see a lot of similarities with this bug [1] reported for etcdctl. Both kubectl and etcdctl from that bug are running as snaps, leaving thousands of systemd scope units, slowing down the system.
Versions:
kubernetes-master charm: 1.18.15 charm revision: 895
Ubuntu: 18.04.5 LTS
kubectl snap: 1.18.15 1.18/stable
$ snap --version
snap 2.50
snapd 2.50
series 16
ubuntu 18.04
kernel 5.4.0-1046-azure
[0] https://github.com/kubernetes/kubernetes/issues/87615
[1] https://bugs.launchpad.net/charm-etcd/+bug/1926185 |
We found over 8.5k snap kubectl systemd scope units on each of our Kubernetes master nodes. This causes 100% CPU usage spikes caused by systemd and /sbin/init processes hosing the entire cluster.
$ sudo systemctl list-units --type scope | grep snap | wc -l
8643
Typical entries look like these:
snap.kubectl.kubectl.001b8133-30f5-42c3-90e4-39da4229cbef.scope loaded active running snap.kubectl.kubectl.001b8133-30f5-42c3-90e4-39da4229cbef.scope
snap.kubectl.kubectl.001e4ffa-17c6-49ea-8fd2-1d7d07fb0e2f.scope loaded active running snap.kubectl.kubectl.001e4ffa-17c6-49ea-8fd2-1d7d07fb0e2f.scope
snap.kubectl.kubectl.0022dadb-f7b5-456f-ac51-98cfcd128035.scope loaded active running snap.kubectl.kubectl.0022dadb-f7b5-456f-ac51-98cfcd128035.scope
snap.kubectl.kubectl.0026e64c-8794-47d9-b58c-a05759814dbd.scope loaded active running snap.kubectl.kubectl.0026e64c-8794-47d9-b58c-a05759814dbd.scope
snap.kubectl.kubectl.002d028f-b30c-4ce7-b6b9-4c72ef74f308.scope loaded active running snap.kubectl.kubectl.002d028f-b30c-4ce7-b6b9-4c72ef74f308.scope
snap.kubectl.kubectl.003584be-b585-48b9-8385-fca6a642433a.scope loaded active running snap.kubectl.kubectl.003584be-b585-48b9-8385-fca6a642433a.scope
snap.kubectl.kubectl.0042ab28-ac94-476b-877d-9cde2d5e2312.scope loaded active running snap.kubectl.kubectl.0042ab28-ac94-476b-877d-9cde2d5e2312.scope
snap.kubectl.kubectl.00537acd-ce13-4a9c-a70d-059d7d1088dc.scope loaded active running snap.kubectl.kubectl.00537acd-ce13-4a9c-a70d-059d7d1088dc.scope
Please note that all of them are in status active/running.
After manually stopping them using this one-liner:
sudo systemctl list-units --type scope | grep kubectl | awk '{print $1}' | xargs sudo systemctl stop
The number goes down to expected values:
sudo systemctl list-units --type scope | grep snap | wc -l
11
And the system becomes much snappier again. The increased load caused by this issue, causes transient failures in communication between API servers and kubelets, resulting in errors similar to this: [0] Then we end up restarting kubelets which is the only way to restore connectivity between kubelets and API servers.
Additionally we see a lot of similarities with this bug [1] reported for etcdctl. Both kubectl and etcdctl from that bug are running as snaps, leaving thousands of systemd scope units, slowing down the system.
Versions:
kubernetes-master charm: 1.18.15 charm revision: 895
Ubuntu: 18.04.5 LTS
kubectl snap: 1.18.15 1.18/stable
$ snap --version
snap 2.50
snapd 2.50
series 16
ubuntu 18.04
kernel 5.4.0-1046-azure
[0] https://github.com/kubernetes/kubernetes/issues/87615
[1] https://bugs.launchpad.net/charm-etcd/+bug/1926185 |
|
2021-05-18 15:08:14 |
Garrett Neugent |
bug |
|
|
added subscriber Canonical IS BootStack |
2021-05-19 09:06:43 |
Calvin Hartwell |
bug |
|
|
added subscriber Canonical Field High |
2021-05-19 09:06:52 |
Calvin Hartwell |
removed subscriber Canonical Field High |
|
|
|
2021-05-19 09:07:18 |
Calvin Hartwell |
bug |
|
|
added subscriber Canonical Field High |
2021-05-19 14:46:48 |
George Kraft |
charm-kubernetes-master: status |
New |
Incomplete |
|
2021-05-20 15:16:40 |
George Kraft |
charm-kubernetes-master: status |
Incomplete |
New |
|
2021-05-20 21:28:17 |
George Kraft |
charm-kubernetes-master: status |
New |
Incomplete |
|
2021-05-21 13:22:16 |
George Kraft |
charm-kubernetes-master: status |
Incomplete |
New |
|
2021-05-24 14:22:27 |
George Kraft |
charm-kubernetes-master: status |
New |
Incomplete |
|
2021-05-24 14:22:39 |
George Kraft |
bug task added |
|
snapd |
|
2021-07-22 20:23:10 |
Ian Johnson |
attachment added |
|
zoom hook sysfs tarball https://bugs.launchpad.net/charm-kubernetes-master/+bug/1928806/+attachment/5512966/+files/zoom-hook-stuck-cgroup.tgz |
|
2021-07-23 11:40:43 |
Ian Johnson |
marked as duplicate |
|
1934147 |
|
2021-07-26 01:53:33 |
Dominique Poulain |
bug |
|
|
added subscriber Dominique Poulain |
2022-01-03 21:43:55 |
Michael Iatrou |
bug |
|
|
added subscriber Michael Iatrou |