During the 1.22 release gate we are seeing the calico units on the k8-master stay blocked indefinitely with the status "Waiting to retry disabling VXLAN TX checksumming":
kubernetes-master/0* active idle 0/lxd/1 10.246.64.231 6443/tcp Kubernetes master running.
calico/6 waiting idle 10.246.64.231 Waiting to retry disabling VXLAN TX checksumming
containerd/6 active idle 10.246.64.231 Container runtime available
hacluster-kubernetes-master/0* active idle 10.246.64.231 Unit is ready and clustered
kubernetes-master/1 active idle 2/lxd/1 10.246.64.236 6443/tcp Kubernetes master running.
calico/7 waiting idle 10.246.64.236 Waiting to retry disabling VXLAN TX checksumming
containerd/7 active idle 10.246.64.236 Container runtime available
hacluster-kubernetes-master/1 active idle 10.246.64.236 Unit is ready and clustered
kubernetes-master/2 active idle 4/lxd/1 10.246.64.237 6443/tcp Kubernetes master running.
calico/8 waiting idle 10.246.64.237 Waiting to retry disabling VXLAN TX checksumming
containerd/8 active idle 10.246.64.237 Container runtime available
hacluster-kubernetes-master/2 active idle 10.246.64.237 Unit is ready and clustered
There is no obvious error in the juju logs for these units:
$ calico_8/var/log/juju/unit-calico-8.log
...
2021-08-30 00:24:25 INFO juju-log Invoking reactive handler: reactive/calico.py:739:disable_vxlan_tx_checksumming
2021-08-30 00:24:25 WARNING leader-settings-changed Cannot get device feature names: No such device
...
However in the calico logs we see a crash starting around the same time, which repeats indefinitely until the run was torn down:
$ calico_8/var/log/felix/current
...
2021-08-30 00:24:26.346 [WARNING][60] int_dataplane.go 723: failed to set XDP failsafe ports, disabling XDP: mkdir /sys/fs/bpf/calico: permission denied
2021-08-30 00:24:26.401 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=0
2021-08-30 00:24:26.473 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=1
2021-08-30 00:24:26.541 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=2
2021-08-30 00:24:26.593 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=3
2021-08-30 00:24:26.919 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=4
2021-08-30 00:24:26.989 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=5
2021-08-30 00:24:27.045 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=6
2021-08-30 00:24:27.105 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=7
2021-08-30 00:24:27.165 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=8
2021-08-30 00:24:27.233 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=9
2021-08-30 00:24:27.233 [PANIC][60] int_dataplane.go 779: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x1a8e900,0xc000152370)
During the 1.22 release gate we are seeing the calico units on the k8-master stay blocked indefinitely with the status "Waiting to retry disabling VXLAN TX checksumming":
kubernetes- master/ 0* active idle 0/lxd/1 10.246.64.231 6443/tcp Kubernetes master running. kubernetes- master/ 0* active idle 10.246.64.231 Unit is ready and clustered kubernetes- master/ 1 active idle 10.246.64.236 Unit is ready and clustered kubernetes- master/ 2 active idle 10.246.64.237 Unit is ready and clustered
calico/6 waiting idle 10.246.64.231 Waiting to retry disabling VXLAN TX checksumming
containerd/6 active idle 10.246.64.231 Container runtime available
hacluster-
kubernetes-master/1 active idle 2/lxd/1 10.246.64.236 6443/tcp Kubernetes master running.
calico/7 waiting idle 10.246.64.236 Waiting to retry disabling VXLAN TX checksumming
containerd/7 active idle 10.246.64.236 Container runtime available
hacluster-
kubernetes-master/2 active idle 4/lxd/1 10.246.64.237 6443/tcp Kubernetes master running.
calico/8 waiting idle 10.246.64.237 Waiting to retry disabling VXLAN TX checksumming
containerd/8 active idle 10.246.64.237 Container runtime available
hacluster-
There is no obvious error in the juju logs for these units:
$ calico_ 8/var/log/ juju/unit- calico- 8.log calico. py:739: disable_ vxlan_tx_ checksumming settings- changed Cannot get device feature names: No such device
...
2021-08-30 00:24:25 INFO juju-log Invoking reactive handler: reactive/
2021-08-30 00:24:25 WARNING leader-
...
However in the calico logs we see a crash starting around the same time, which repeats indefinitely until the run was torn down:
$ calico_ 8/var/log/ felix/current 0xc000152370)
...
2021-08-30 00:24:26.346 [WARNING][60] int_dataplane.go 723: failed to set XDP failsafe ports, disabling XDP: mkdir /sys/fs/bpf/calico: permission denied
2021-08-30 00:24:26.401 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=0
2021-08-30 00:24:26.473 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=1
2021-08-30 00:24:26.541 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=2
2021-08-30 00:24:26.593 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=3
2021-08-30 00:24:26.919 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=4
2021-08-30 00:24:26.989 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=5
2021-08-30 00:24:27.045 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=6
2021-08-30 00:24:27.105 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=7
2021-08-30 00:24:27.165 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=8
2021-08-30 00:24:27.233 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=9
2021-08-30 00:24:27.233 [PANIC][60] int_dataplane.go 779: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x1a8e900,
goroutine 1 [running]: com/sirupsen/ logrus. Entry.log( 0xc0000b8050, 0xc0006a6540, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f0b00000000, ...) /entry. go:112 +0x2d2 com/sirupsen/ logrus. (*Entry) .Panic( 0xc0001521e0, 0xc00045a250, 0x1, 0x1) /entry. go:182 +0x103 com/sirupsen/ logrus. (*Entry) .Panicf( 0xc0001521e0, 0x1b11d88, 0x2b, 0xc00045a300, 0x1, 0x1) /entry. go:230 +0xd4 com/sirupsen/ logrus. (*Logger) .Panicf( 0xc0000b8050, 0x1b11d88, 0x2b, 0xc00045a300, 0x1, 0x1) /logger. go:173 +0x86 com/sirupsen/ logrus. Panicf( ...) /exported. go:145 com/projectcali co/felix/ dataplane/ linux.( *InternalDatapl ane).shutdownXD PCompletely( 0xc0001f9680) /dataplane/ linux/int_ dataplane. go:779 +0x2cd com/projectcali co/felix/ dataplane/ linux.( *InternalDatapl ane).doStaticDa taplaneConfig( 0xc0001f9680) /dataplane/ linux/int_ dataplane. go:724 +0xc22 com/projectcali co/felix/ dataplane/ linux.( *InternalDatapl ane).Start( 0xc0001f9680) /dataplane/ linux/int_ dataplane. go:584 +0x2f com/projectcali co/felix/ dataplane. StartDataplaneD river(0xc000530 000, 0xc0004fb2c0, 0xc0003c9ba0, 0x1, 0xc0003d17c0, 0x0) /dataplane/ driver. go:186 +0xf09 com/projectcali co/felix/ daemon. Run(0x1ae3b30, 0x15, 0x1db1ff0, 0x7, 0x1e083e0, 0x28, 0x1ddf000, 0x18) /daemon/ daemon. go:304 +0x18d7
/go/src/ github. com/projectcali co/node/ cmd/calico- node/main. go:100 +0x405
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
github.
/<email address hidden>
main.main()
...
An example of this failure can be found here: https:/ /solutions. qa.canonical. com/testruns/ testRun/ 6d914153- cae3-4445- a444-668b7c3c96 50 /oil-jenkins. canonical. com/artifacts/ 6d914153- cae3-4445- a444-668b7c3c96 50/generated/ generated/ kubernetes/ juju-crashdump- kubernetes- 2021-08- 30-04.17. 50.tar. gz /oil-jenkins. canonical. com/artifacts/ 6d914153- cae3-4445- a444-668b7c3c96 50/generated/ generated/ kubernetes/ bundle. yaml
Along with its crashdump: https:/
And bundle: https:/
Note we have "disable- vxlan-tx- checksumming" set to "true" per the default in the charm.