Activity log for bug #2034973

Date Who What changed Old value New value Message
2023-09-08 21:05:56 Gustavo Sanchez bug added bug
2023-09-08 21:10:57 Gustavo Sanchez description Upgrading the machine for a kubernetes-worker machine it got stuck on pre-series-upgrade hook # ============================== Logs # =============== juju upgrade-machine 11 prepare jammy (stuck) 16:26:17 INFO cmd upgrademachine.go:479 machine-11 validation of upgrade series from "focal" to "jammy" advanced-policy-routing/15 pre-series-upgrade hook running advanced-policy-routing/15 pre-series-upgrade completed ntp/9 pre-series-upgrade hook running machine-11 started upgrade series from "focal" to "jammy" 16:26:17 INFO cmd upgrademachine.go:479 ntp/9 pre-series-upgrade completed logrotated/15 pre-series-upgrade hook running 16:26:18 INFO cmd upgrademachine.go:479 logrotated/15 pre-series-upgrade completed 16:26:18 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade hook running 16:26:21 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade completed 16:26:21 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade hook running 16:26:23 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade completed 16:26:23 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade hook running 16:26:24 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade completed 16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook running 16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook not found, skipping 16:26:24 INFO cmd upgrademachine.go:479 kubernetes-worker/5 pre-series-upgrade hook running # =============== juju status Every 2.0s: juju status --color | egrep 'executing|maint|block|error' infra1.cloud.rc.uab.edu: Fri Sep 8 16:57:[0/572] containerd 1.6.8 blocked 9 containerd 1.28/stable 69 no Series upgrade in progress kubernetes-worker/5 active executing 11 138.26.125.127 80/tcp,443/tcp (pre-series-upgrade) Kubernetes worker running (without gpu support). containerd/9 blocked idle 138.26.125.127 Series upgrade in progress # =============== juju ssh 11 cat /var/log/juju/machine-11.log # .. 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "kubernetes-worker" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "nrpe" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "ntp" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "containerd" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "filebeat" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "logrotated" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "telegraf" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "advanced-policy-routing" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "calico" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:39:19 ERROR juju.worker.metrics.sender sender.go:80 could not remove batch "33d42685-6d34-4aeb-8f9b-8b32b60269ce" from spool: remove /var/lib/juju/metricspool/33d42685-6 d34-4aeb-8f9b-8b32b60269ce: no such file or directory # ================ juju ssh kubernetes-worker/5 cat /var/log/juju/unit-kubernetes-worker-5.log https://paste.ubuntu.com/p/Cy4QHfVrqP/ # =============== juju show-status-log kubernetes-worker/5 # .. 08 Sep 2023 16:23:37Z workload active Kubernetes worker running (without gpu support). 08 Sep 2023 16:26:24Z juju-unit executing running pre-series-upgrade hook $ date Fri 08 Sep 2023 04:54:13 PM UTC # ============================== Workaround The kubernetes-worker unit was stuck running this hook https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/adf796108dca7e0de23befc6dfe593e5ae26db3f/reactive/kubernetes_worker.py#L195-L210 It was stuck draining pods from the worker So, I propose we add a message or log before starting the draining This way from Juju POV (ideally from juju status, but from juju debug-log as well) we know what is happening Specially for cases where pods take time to evict # --- Here how to force them out if it is stuck kubectl drain <k8s-node-name> --ignore-daemonsets --delete-emptydir-data --force --grace-period 0 Upgrading the machine for a kubernetes-worker machine it got stuck on pre-series-upgrade hook # ============================== Logs # =============== juju upgrade-machine 11 prepare jammy (stuck) 16:26:17 INFO cmd upgrademachine.go:479 machine-11 validation of upgrade series from "focal" to "jammy" advanced-policy-routing/15 pre-series-upgrade hook running advanced-policy-routing/15 pre-series-upgrade completed ntp/9 pre-series-upgrade hook running machine-11 started upgrade series from "focal" to "jammy" 16:26:17 INFO cmd upgrademachine.go:479 ntp/9 pre-series-upgrade completed logrotated/15 pre-series-upgrade hook running 16:26:18 INFO cmd upgrademachine.go:479 logrotated/15 pre-series-upgrade completed 16:26:18 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade hook running 16:26:21 INFO cmd upgrademachine.go:479 telegraf/16 pre-series-upgrade completed 16:26:21 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade hook running 16:26:23 INFO cmd upgrademachine.go:479 filebeat/16 pre-series-upgrade completed 16:26:23 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade hook running 16:26:24 INFO cmd upgrademachine.go:479 containerd/9 pre-series-upgrade completed 16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook running 16:26:24 INFO cmd upgrademachine.go:479 nrpe/9 pre-series-upgrade hook not found, skipping 16:26:24 INFO cmd upgrademachine.go:479 kubernetes-worker/5 pre-series-upgrade hook running # =============== juju status Every 2.0s: juju status --color | egrep 'executing|maint|block|error' infra1.cloud.rc.uab.edu: Fri Sep 8 16:57:[0/572] containerd 1.6.8 blocked 9 containerd 1.28/stable 69 no Series upgrade in progress kubernetes-worker/5 active executing 11 138.26.125.127 80/tcp,443/tcp (pre-series-upgrade) Kubernetes worker running (without gpu support).   containerd/9 blocked idle 138.26.125.127 Series upgrade in progress # =============== juju ssh 11 cat /var/log/juju/machine-11.log # .. 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "kubernetes-worker" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "nrpe" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "ntp" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "containerd" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "filebeat" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "logrotated" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "telegraf" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "advanced-policy-routing" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:364 unpin leader for application "calico" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:17 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:18 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:21 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:23 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:26:24 INFO juju.worker.upgradeseries worker.go:170 machine series upgrade status is "prepare started" 2023-09-08 16:39:19 ERROR juju.worker.metrics.sender sender.go:80 could not remove batch "33d42685-6d34-4aeb-8f9b-8b32b60269ce" from spool: remove /var/lib/juju/metricspool/33d42685-6 d34-4aeb-8f9b-8b32b60269ce: no such file or directory # ================ juju ssh kubernetes-worker/5 cat /var/log/juju/unit-kubernetes-worker-5.log https://paste.ubuntu.com/p/Cy4QHfVrqP/ # =============== juju show-status-log kubernetes-worker/5 # .. 08 Sep 2023 16:23:37Z workload active Kubernetes worker running (without gpu support). 08 Sep 2023 16:26:24Z juju-unit executing running pre-series-upgrade hook $ date Fri 08 Sep 2023 05:44:13 PM UTC # About 1h+ after executing started # ============================== Workaround The kubernetes-worker unit was stuck running this hook https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/adf796108dca7e0de23befc6dfe593e5ae26db3f/reactive/kubernetes_worker.py#L195-L210 It was stuck draining pods from the worker So, I propose we add a message or log before starting the draining This way from Juju POV (ideally from juju status, but from juju debug-log as well) we know what is happening Specially for cases where pods take time to evict # --- Here how to force them out if it is stuck kubectl drain <k8s-node-name> --ignore-daemonsets --delete-emptydir-data --force --grace-period 0
2023-09-09 03:15:24 Adam Dyess charm-kubernetes-worker: milestone 1.29
2023-09-09 03:15:52 Adam Dyess charm-kubernetes-worker: status New Triaged
2023-09-09 03:15:54 Adam Dyess charm-kubernetes-worker: importance Undecided Wishlist
2023-09-09 03:16:26 Adam Dyess bug task added charm-kubernetes-master
2023-09-09 03:16:31 Adam Dyess charm-kubernetes-master: status New Triaged
2023-09-09 03:16:33 Adam Dyess charm-kubernetes-master: milestone 1.29
2023-09-09 03:16:35 Adam Dyess charm-kubernetes-master: importance Undecided Wishlist