2021-04-13 13:25:53 |
Michele Baldessari |
description |
I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail):
dump --memory-only --file /tmp/undercloud-dump.crash --live undercloud-0
I loaded up the vmcore in the crash utility [1]
crash kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux undercloud-dump.crash
And could conclude the following (UC has 16GB of RAM):
A) Load was sky-rocket high and free memory was none
KERNEL: kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux
DUMPFILE: undercloud-dump.crash
CPUS: 4
DATE: Tue Apr 13 05:32:44 2021
UPTIME: 41 days, 15:28:18
LOAD AVERAGE: 31.78, 31.95, 32.38
TASKS: 4242
NODENAME: undercloud-0.bgp.ftw
RELEASE: 4.18.0-240.10.1.el8_3.x86_64
VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021
MACHINE: x86_64 (2194 Mhz)
MEMORY: 16 GB
PANIC: ""
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 4052899 15.5 GB ----
FREE 35350 138.1 MB 0% of TOTAL MEM
USED 4017549 15.3 GB 99% of TOTAL MEM
SHARED 203722 795.8 MB 5% of TOTAL MEM
BUFFERS 0 0 0% of TOTAL MEM
CACHED 533131 2 GB 13% of TOTAL MEM
SLAB 1360379 5.2 GB 33% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 0 0 ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 0 0 0% of TOTAL SWAP
COMMIT LIMIT 2026449 7.7 GB ----
COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT
B) Most memory was used up by an incredibly large amount of podman processes
crash> ps -u -G|tail -n +2|cut -b2- | sort -n -k8 | awk '{print $8/1048576" "$9}' | awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' | sort -n -k2|tail -n10
iscsid 0.0118484
bash 0.0243454
sshd 0.063778
httpd 0.0780067
run-parts 0.0805359
logger 0.141033
podman 0.202381
crond 1.26209
(ontainer) 3.1892
(podman) 17.2151
crash> ps -u -G |wc -l
3775
crash> ps -u -G |grep podman |wc -l
2555
C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1.
crash> ps -u -G |grep "(podman)" |wc -l
2547
D) Under a normal freshly deployed and working undercloud there basically are *no* podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw:
crash> ps -u -G |grep -e podman |wc -l
0
crash> ps -u -G |grep -e conmon |wc -l
23
which is a lot more sensible.
[1] https://crash-utility.github.io/ |
I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail):
virsh dump --memory-only --file /tmp/undercloud-dump.crash --live undercloud-0
I loaded up the vmcore in the crash utility [1]
crash kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux undercloud-dump.crash
And could conclude the following (UC has 16GB of RAM):
A) Load was sky-rocket high and free memory was none
KERNEL: kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux
DUMPFILE: undercloud-dump.crash
CPUS: 4
DATE: Tue Apr 13 05:32:44 2021
UPTIME: 41 days, 15:28:18
LOAD AVERAGE: 31.78, 31.95, 32.38
TASKS: 4242
NODENAME: undercloud-0.bgp.ftw
RELEASE: 4.18.0-240.10.1.el8_3.x86_64
VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021
MACHINE: x86_64 (2194 Mhz)
MEMORY: 16 GB
PANIC: ""
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 4052899 15.5 GB ----
FREE 35350 138.1 MB 0% of TOTAL MEM
USED 4017549 15.3 GB 99% of TOTAL MEM
SHARED 203722 795.8 MB 5% of TOTAL MEM
BUFFERS 0 0 0% of TOTAL MEM
CACHED 533131 2 GB 13% of TOTAL MEM
SLAB 1360379 5.2 GB 33% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 0 0 ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 0 0 0% of TOTAL SWAP
COMMIT LIMIT 2026449 7.7 GB ----
COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT
B) Most memory was used up by an incredibly large amount of podman processes
crash> ps -u -G|tail -n +2|cut -b2- | sort -n -k8 | awk '{print $8/1048576" "$9}' | awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' | sort -n -k2|tail -n10
iscsid 0.0118484
bash 0.0243454
sshd 0.063778
httpd 0.0780067
run-parts 0.0805359
logger 0.141033
podman 0.202381
crond 1.26209
(ontainer) 3.1892
(podman) 17.2151
crash> ps -u -G |wc -l
3775
crash> ps -u -G |grep podman |wc -l
2555
C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1.
crash> ps -u -G |grep "(podman)" |wc -l
2547
D) Under a normal freshly deployed and working undercloud there basically are *no* podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw:
crash> ps -u -G |grep -e podman |wc -l
0
crash> ps -u -G |grep -e conmon |wc -l
23
which is a lot more sensible.
[1] https://crash-utility.github.io/ |
|