[1.22] Calico units on k8-master blocked indefinitely with "Waiting to retry disabling VXLAN TX checksumming"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Calico Charm |
Invalid
|
Undecided
|
Unassigned | ||
Kubernetes Control Plane Charm |
Fix Released
|
Critical
|
Kevin W Monroe | ||
Kubernetes Worker Charm |
Fix Released
|
Critical
|
Kevin W Monroe |
Bug Description
During the 1.22 release gate we are seeing the calico units on the k8-master stay blocked indefinitely with the status "Waiting to retry disabling VXLAN TX checksumming":
kubernetes-
calico/6 waiting idle 10.246.64.231 Waiting to retry disabling VXLAN TX checksumming
containerd/6 active idle 10.246.64.231 Container runtime available
hacluster-
kubernetes-master/1 active idle 2/lxd/1 10.246.64.236 6443/tcp Kubernetes master running.
calico/7 waiting idle 10.246.64.236 Waiting to retry disabling VXLAN TX checksumming
containerd/7 active idle 10.246.64.236 Container runtime available
hacluster-
kubernetes-master/2 active idle 4/lxd/1 10.246.64.237 6443/tcp Kubernetes master running.
calico/8 waiting idle 10.246.64.237 Waiting to retry disabling VXLAN TX checksumming
containerd/8 active idle 10.246.64.237 Container runtime available
hacluster-
There is no obvious error in the juju logs for these units:
$ calico_
...
2021-08-30 00:24:25 INFO juju-log Invoking reactive handler: reactive/
2021-08-30 00:24:25 WARNING leader-
...
However in the calico logs we see a crash starting around the same time, which repeats indefinitely until the run was torn down:
$ calico_
...
2021-08-30 00:24:26.346 [WARNING][60] int_dataplane.go 723: failed to set XDP failsafe ports, disabling XDP: mkdir /sys/fs/bpf/calico: permission denied
2021-08-30 00:24:26.401 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=0
2021-08-30 00:24:26.473 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=1
2021-08-30 00:24:26.541 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=2
2021-08-30 00:24:26.593 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=3
2021-08-30 00:24:26.919 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=4
2021-08-30 00:24:26.989 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=5
2021-08-30 00:24:27.045 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=6
2021-08-30 00:24:27.105 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=7
2021-08-30 00:24:27.165 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=8
2021-08-30 00:24:27.233 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=9
2021-08-30 00:24:27.233 [PANIC][60] int_dataplane.go 779: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x1a8e900,
goroutine 1 [running]:
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
main.main()
...
Note we have "disable-
An example of this failure can be found here: https:/
Along with its crashdump: https:/
And bundle: https:/
All occurrences of this bug can be found here: https:/
description: | updated |
tags: | added: cdo-relase-blocker cdoqa foundation-engine |
tags: |
added: cdo-qa cdo-release-blocker foundations-engine removed: cdo-relase-blocker cdoqa foundation-engine |
Changed in charm-kubernetes-master: | |
status: | Fix Committed → Fix Released |
Changed in charm-kubernetes-worker: | |
status: | Fix Committed → Fix Released |
K8s engr determined this to be caused by a change to the lxd-profile.xml in both k8s-master and -worker charms. Originally changed for https:/ /bugs.launchpad .net/snapd/ +bug/1907153, we decided a workaround for that issue was the best path forward to ensure calico/vxlan/lxd functionality did not regress.
PRs: /github. com/charmed- kubernetes/ charm-kubernete s-master/ pull/177 /github. com/charmed- kubernetes/ charm-kubernete s-worker/ pull/97
- https:/
- https:/