undercloud (and overcloud nodes) in master became unresponsive after a couple of weeks
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Michele Baldessari |
Bug Description
I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail):
virsh dump --memory-only --file /tmp/undercloud
I loaded up the vmcore in the crash utility [1]
crash kernel/
And could conclude the following (UC has 16GB of RAM):
A) Load was sky-rocket high and free memory was none
KERNEL: kernel/
DUMPFILE: undercloud-
CPUS: 4
DATE: Tue Apr 13 05:32:44 2021
UPTIME: 41 days, 15:28:18
LOAD AVERAGE: 31.78, 31.95, 32.38
TASKS: 4242
NODENAME: undercloud-
RELEASE: 4.18.0-
VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021
MACHINE: x86_64 (2194 Mhz)
MEMORY: 16 GB
PANIC: ""
crash> kmem -i
TOTAL MEM 4052899 15.5 GB ----
FREE 35350 138.1 MB 0% of TOTAL MEM
USED 4017549 15.3 GB 99% of TOTAL MEM
SHARED 203722 795.8 MB 5% of TOTAL MEM
BUFFERS 0 0 0% of TOTAL MEM
CACHED 533131 2 GB 13% of TOTAL MEM
SLAB 1360379 5.2 GB 33% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 0 0 ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 0 0 0% of TOTAL SWAP
COMMIT LIMIT 2026449 7.7 GB ----
COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT
B) Most memory was used up by an incredibly large amount of podman processes
crash> ps -u -G|tail -n +2|cut -b2- | sort -n -k8 | awk '{print $8/1048576" "$9}' | awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' | sort -n -k2|tail -n10
iscsid 0.0118484
bash 0.0243454
sshd 0.063778
httpd 0.0780067
run-parts 0.0805359
logger 0.141033
podman 0.202381
crond 1.26209
(ontainer) 3.1892
(podman) 17.2151
crash> ps -u -G |wc -l
3775
crash> ps -u -G |grep podman |wc -l
2555
C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1.
crash> ps -u -G |grep "(podman)" |wc -l
2547
D) Under a normal freshly deployed and working undercloud there basically are *no* podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw:
crash> ps -u -G |grep -e podman |wc -l
0
crash> ps -u -G |grep -e conmon |wc -l
23
which is a lot more sensible.
description: | updated |
Changed in tripleo: | |
importance: | High → Critical |
Changed in tripleo: | |
milestone: | wallaby-rc1 → xena-1 |
Changed in tripleo: | |
status: | Triaged → In Progress |
tags: | added: alert |
Changed in tripleo: | |
assignee: | nobody → Michele Baldessari (michele) |
Interestingly if we inspect a normal "podman" process we see: containers/ storage --runroot /var/run/ containers/ storage --log-level error --cgroup-manager systemd --tmpdir /var/run/libpod --runtime runc --storage-driver overlay --storage-opt overlay. mountopt= nodev,metacopy= on --events-backend file container cleanup 881e8ef19bb5e57 de90c1fb2a784f8 21934707b1075c4 4452e2e355b9df3 aba7 local/sbin: /usr/local/ bin:/usr/ sbin:/usr/ bin:/sbin: /bin SYNCPIPE= 3 STARTPIPE= 4 RUNTIME_ DIR= S_USERNS_ CONFIGURED= S_ROOTLESS_ UID=
PID: 846098 TASK: ffff9c20fd8ddc40 CPU: 1 COMMAND: "podman"
ARG: /usr/bin/podman --root /var/lib/
ENV: PATH=/usr/
_OCI_
_OCI_
XDG_
_CONTAINER
_CONTAINER
But those '(podman)' processes do not show any arguments nor env variables: IMAGE=( hd0,msdos1) /boot/vmlinuz- 4.18.0- 240.10. 1.el8_3. x86_64 l=auto
PID: 846103 TASK: ffff9c20f8aa1ec0 CPU: 1 COMMAND: "(podman)"
ARG: (podman)
ENV: HOME=/
TERM=vt220
BOOT_
crashkerne