2023-09-08 21:10:57 |
Gustavo Sanchez |
description |
Upgrading the machine for a kubernetes-worker machine it got stuck on pre-series-upgrade hook
# ============================== Logs
# =============== juju upgrade-machine 11 prepare jammy (stuck)
16:26:17 INFO cmd upgrademachine.go:479 machine-11 validation of upgrade series from "focal" to "jammy"
advanced-policy-routing/15 pre-series-upgrade hook running
advanced-policy-routing/15 pre-series-upgrade completed
ntp/9 pre-series-upgrade hook running
machine-11 started upgrade series from "focal" to "jammy"
16:26:17 INFO cmd upgrademachine.go:479 ntp/9 pre-series-upgrade completed
logrotated/15 pre-series-upgrade hook running
16:26:18 INFO cmd upgrademachine.go:479 logrotated/15 pre-series-upgrade completed
16:26:18 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade hook running
16:26:21 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade completed
16:26:21 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade hook running
16:26:23 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade completed
16:26:23 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade hook running
16:26:24 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade completed
16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook running
16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook not found, skipping
16:26:24 INFO cmd upgrademachine.go:479 kubernetes-worker/5 pre-series-upgrade hook running
# =============== juju status
Every 2.0s: juju status --color | egrep 'executing|maint|block|error' infra1.cloud.rc.uab.edu: Fri Sep 8 16:57:[0/572]
containerd 1.6.8 blocked 9 containerd 1.28/stable 69 no Series upgrade in progress
kubernetes-worker/5 active executing 11 138.26.125.127 80/tcp,443/tcp (pre-series-upgrade) Kubernetes worker running (without gpu support).
containerd/9 blocked idle 138.26.125.127 Series upgrade in progress
# =============== juju ssh 11 cat /var/log/juju/machine-11.log
# ..
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "kubernetes-worker"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "nrpe"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "ntp"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "containerd"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "filebeat"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "logrotated"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "telegraf"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "advanced-policy-routing"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "calico"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:39:19 ERROR juju.worker.metrics.sender sender.go:80 could not remove batch "33d42685-6d34-4aeb-8f9b-8b32b60269ce" from spool: remove /var/lib/juju/metricspool/33d42685-6
d34-4aeb-8f9b-8b32b60269ce: no such file or directory
# ================ juju ssh kubernetes-worker/5 cat /var/log/juju/unit-kubernetes-worker-5.log
https://paste.ubuntu.com/p/Cy4QHfVrqP/
# =============== juju show-status-log kubernetes-worker/5
# ..
08 Sep 2023 16:23:37Z workload active Kubernetes worker running (without gpu support).
08 Sep 2023 16:26:24Z juju-unit executing running pre-series-upgrade hook
$ date
Fri 08 Sep 2023 04:54:13 PM UTC
# ============================== Workaround
The kubernetes-worker unit was stuck running this hook
https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/adf796108dca7e0de23befc6dfe593e5ae26db3f/reactive/kubernetes_worker.py#L195-L210
It was stuck draining pods from the worker
So, I propose we add a message or log before starting the draining
This way from Juju POV (ideally from juju status, but from juju debug-log as well) we know what is happening
Specially for cases where pods take time to evict
# ---
Here how to force them out if it is stuck
kubectl drain <k8s-node-name> --ignore-daemonsets --delete-emptydir-data --force --grace-period 0 |
Upgrading the machine for a kubernetes-worker machine it got stuck on pre-series-upgrade hook
# ============================== Logs
# =============== juju upgrade-machine 11 prepare jammy (stuck)
16:26:17 INFO cmd upgrademachine.go:479 machine-11 validation of upgrade series from "focal" to "jammy"
advanced-policy-routing/15 pre-series-upgrade hook running
advanced-policy-routing/15 pre-series-upgrade completed
ntp/9 pre-series-upgrade hook running
machine-11 started upgrade series from "focal" to "jammy"
16:26:17 INFO cmd upgrademachine.go:479 ntp/9 pre-series-upgrade completed
logrotated/15 pre-series-upgrade hook running
16:26:18 INFO cmd upgrademachine.go:479 logrotated/15 pre-series-upgrade completed
16:26:18 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade hook running
16:26:21 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade completed
16:26:21 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade hook running
16:26:23 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade completed
16:26:23 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade hook running
16:26:24 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade completed
16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook running
16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook not found, skipping
16:26:24 INFO cmd upgrademachine.go:479 kubernetes-worker/5 pre-series-upgrade hook running
# =============== juju status
Every 2.0s: juju status --color | egrep 'executing|maint|block|error' infra1.cloud.rc.uab.edu: Fri Sep 8 16:57:[0/572]
containerd 1.6.8 blocked 9 containerd 1.28/stable 69 no Series upgrade in progress
kubernetes-worker/5 active executing 11 138.26.125.127 80/tcp,443/tcp (pre-series-upgrade) Kubernetes worker running (without gpu support).
containerd/9 blocked idle 138.26.125.127 Series upgrade in progress
# =============== juju ssh 11 cat /var/log/juju/machine-11.log
# ..
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "kubernetes-worker"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "nrpe"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "ntp"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "containerd"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "filebeat"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "logrotated"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "telegraf"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "advanced-policy-routing"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "calico"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started"
2023-09-08 16:39:19 ERROR juju.worker.metrics.sender sender.go:80 could not remove batch "33d42685-6d34-4aeb-8f9b-8b32b60269ce" from spool: remove /var/lib/juju/metricspool/33d42685-6
d34-4aeb-8f9b-8b32b60269ce: no such file or directory
# ================ juju ssh kubernetes-worker/5 cat /var/log/juju/unit-kubernetes-worker-5.log
https://paste.ubuntu.com/p/Cy4QHfVrqP/
# =============== juju show-status-log kubernetes-worker/5
# ..
08 Sep 2023 16:23:37Z workload active Kubernetes worker running (without gpu support).
08 Sep 2023 16:26:24Z juju-unit executing running pre-series-upgrade hook
$ date
Fri 08 Sep 2023 05:44:13 PM UTC # About 1h+ after executing started
# ============================== Workaround
The kubernetes-worker unit was stuck running this hook
https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/adf796108dca7e0de23befc6dfe593e5ae26db3f/reactive/kubernetes_worker.py#L195-L210
It was stuck draining pods from the worker
So, I propose we add a message or log before starting the draining
This way from Juju POV (ideally from juju status, but from juju debug-log as well) we know what is happening
Specially for cases where pods take time to evict
# ---
Here how to force them out if it is stuck
kubectl drain <k8s-node-name> --ignore-daemonsets --delete-emptydir-data --force --grace-period 0 |
|