Comment 0 for bug 1942099

Revision history for this message
Michael Skalka (mskalka) wrote :

During the 1.22 release gate we are seeing the calico units on the k8-master stay blocked indefinitely with the status "Waiting to retry disabling VXLAN TX checksumming":

kubernetes-master/0* active idle 0/lxd/1 10.246.64.231 6443/tcp Kubernetes master running.
  calico/6 waiting idle 10.246.64.231 Waiting to retry disabling VXLAN TX checksumming
  containerd/6 active idle 10.246.64.231 Container runtime available
  hacluster-kubernetes-master/0* active idle 10.246.64.231 Unit is ready and clustered
kubernetes-master/1 active idle 2/lxd/1 10.246.64.236 6443/tcp Kubernetes master running.
  calico/7 waiting idle 10.246.64.236 Waiting to retry disabling VXLAN TX checksumming
  containerd/7 active idle 10.246.64.236 Container runtime available
  hacluster-kubernetes-master/1 active idle 10.246.64.236 Unit is ready and clustered
kubernetes-master/2 active idle 4/lxd/1 10.246.64.237 6443/tcp Kubernetes master running.
  calico/8 waiting idle 10.246.64.237 Waiting to retry disabling VXLAN TX checksumming
  containerd/8 active idle 10.246.64.237 Container runtime available
  hacluster-kubernetes-master/2 active idle 10.246.64.237 Unit is ready and clustered

There is no obvious error in the juju logs for these units:

$ calico_8/var/log/juju/unit-calico-8.log
...
2021-08-30 00:24:25 INFO juju-log Invoking reactive handler: reactive/calico.py:739:disable_vxlan_tx_checksumming
2021-08-30 00:24:25 WARNING leader-settings-changed Cannot get device feature names: No such device
...

However in the calico logs we see a crash starting around the same time, which repeats indefinitely until the run was torn down:

$ calico_8/var/log/felix/current
...
2021-08-30 00:24:26.346 [WARNING][60] int_dataplane.go 723: failed to set XDP failsafe ports, disabling XDP: mkdir /sys/fs/bpf/calico: permission denied
2021-08-30 00:24:26.401 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=0
2021-08-30 00:24:26.473 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=1
2021-08-30 00:24:26.541 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=2
2021-08-30 00:24:26.593 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=3
2021-08-30 00:24:26.919 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=4
2021-08-30 00:24:26.989 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=5
2021-08-30 00:24:27.045 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=6
2021-08-30 00:24:27.105 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=7
2021-08-30 00:24:27.165 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=8
2021-08-30 00:24:27.233 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=9
2021-08-30 00:24:27.233 [PANIC][60] int_dataplane.go 779: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x1a8e900,0xc000152370)

goroutine 1 [running]:
github.com/sirupsen/logrus.Entry.log(0xc0000b8050, 0xc0006a6540, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f0b00000000, ...)
        /<email address hidden>/entry.go:112 +0x2d2
github.com/sirupsen/logrus.(*Entry).Panic(0xc0001521e0, 0xc00045a250, 0x1, 0x1)
        /<email address hidden>/entry.go:182 +0x103
github.com/sirupsen/logrus.(*Entry).Panicf(0xc0001521e0, 0x1b11d88, 0x2b, 0xc00045a300, 0x1, 0x1)
        /<email address hidden>/entry.go:230 +0xd4
github.com/sirupsen/logrus.(*Logger).Panicf(0xc0000b8050, 0x1b11d88, 0x2b, 0xc00045a300, 0x1, 0x1)
        /<email address hidden>/logger.go:173 +0x86
github.com/sirupsen/logrus.Panicf(...)
        /<email address hidden>/exported.go:145
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).shutdownXDPCompletely(0xc0001f9680)
        /<email address hidden>/dataplane/linux/int_dataplane.go:779 +0x2cd
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).doStaticDataplaneConfig(0xc0001f9680)
        /<email address hidden>/dataplane/linux/int_dataplane.go:724 +0xc22
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).Start(0xc0001f9680)
        /<email address hidden>/dataplane/linux/int_dataplane.go:584 +0x2f
github.com/projectcalico/felix/dataplane.StartDataplaneDriver(0xc000530000, 0xc0004fb2c0, 0xc0003c9ba0, 0x1, 0xc0003d17c0, 0x0)
        /<email address hidden>/dataplane/driver.go:186 +0xf09
github.com/projectcalico/felix/daemon.Run(0x1ae3b30, 0x15, 0x1db1ff0, 0x7, 0x1e083e0, 0x28, 0x1ddf000, 0x18)
        /<email address hidden>/daemon/daemon.go:304 +0x18d7
main.main()
        /go/src/github.com/projectcalico/node/cmd/calico-node/main.go:100 +0x405
...

An example of this failure can be found here: https://solutions.qa.canonical.com/testruns/testRun/6d914153-cae3-4445-a444-668b7c3c9650
Along with its crashdump: https://oil-jenkins.canonical.com/artifacts/6d914153-cae3-4445-a444-668b7c3c9650/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-08-30-04.17.50.tar.gz
And bundle: https://oil-jenkins.canonical.com/artifacts/6d914153-cae3-4445-a444-668b7c3c9650/generated/generated/kubernetes/bundle.yaml

Note we have "disable-vxlan-tx-checksumming" set to "true" per the default in the charm.