[1.22] Calico units on k8-master blocked indefinitely with "Waiting to retry disabling VXLAN TX checksumming"

Bug #1942099 reported by Michael Skalka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Calico Charm
Invalid
Undecided
Unassigned
Kubernetes Control Plane Charm
Fix Released
Critical
Kevin W Monroe
Kubernetes Worker Charm
Fix Released
Critical
Kevin W Monroe

Bug Description

During the 1.22 release gate we are seeing the calico units on the k8-master stay blocked indefinitely with the status "Waiting to retry disabling VXLAN TX checksumming":

kubernetes-master/0* active idle 0/lxd/1 10.246.64.231 6443/tcp Kubernetes master running.
  calico/6 waiting idle 10.246.64.231 Waiting to retry disabling VXLAN TX checksumming
  containerd/6 active idle 10.246.64.231 Container runtime available
  hacluster-kubernetes-master/0* active idle 10.246.64.231 Unit is ready and clustered
kubernetes-master/1 active idle 2/lxd/1 10.246.64.236 6443/tcp Kubernetes master running.
  calico/7 waiting idle 10.246.64.236 Waiting to retry disabling VXLAN TX checksumming
  containerd/7 active idle 10.246.64.236 Container runtime available
  hacluster-kubernetes-master/1 active idle 10.246.64.236 Unit is ready and clustered
kubernetes-master/2 active idle 4/lxd/1 10.246.64.237 6443/tcp Kubernetes master running.
  calico/8 waiting idle 10.246.64.237 Waiting to retry disabling VXLAN TX checksumming
  containerd/8 active idle 10.246.64.237 Container runtime available
  hacluster-kubernetes-master/2 active idle 10.246.64.237 Unit is ready and clustered

There is no obvious error in the juju logs for these units:

$ calico_8/var/log/juju/unit-calico-8.log
...
2021-08-30 00:24:25 INFO juju-log Invoking reactive handler: reactive/calico.py:739:disable_vxlan_tx_checksumming
2021-08-30 00:24:25 WARNING leader-settings-changed Cannot get device feature names: No such device
...

However in the calico logs we see a crash starting around the same time, which repeats indefinitely until the run was torn down:

$ calico_8/var/log/felix/current
...
2021-08-30 00:24:26.346 [WARNING][60] int_dataplane.go 723: failed to set XDP failsafe ports, disabling XDP: mkdir /sys/fs/bpf/calico: permission denied
2021-08-30 00:24:26.401 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=0
2021-08-30 00:24:26.473 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=1
2021-08-30 00:24:26.541 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=2
2021-08-30 00:24:26.593 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=3
2021-08-30 00:24:26.919 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=4
2021-08-30 00:24:26.989 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=5
2021-08-30 00:24:27.045 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=6
2021-08-30 00:24:27.105 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=7
2021-08-30 00:24:27.165 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=8
2021-08-30 00:24:27.233 [WARNING][60] int_dataplane.go 776: failed to wipe the XDP state error=mkdir /sys/fs/bpf/calico: permission denied try=9
2021-08-30 00:24:27.233 [PANIC][60] int_dataplane.go 779: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x1a8e900,0xc000152370)

goroutine 1 [running]:
github.com/sirupsen/logrus.Entry.log(0xc0000b8050, 0xc0006a6540, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f0b00000000, ...)
        /<email address hidden>/entry.go:112 +0x2d2
github.com/sirupsen/logrus.(*Entry).Panic(0xc0001521e0, 0xc00045a250, 0x1, 0x1)
        /<email address hidden>/entry.go:182 +0x103
github.com/sirupsen/logrus.(*Entry).Panicf(0xc0001521e0, 0x1b11d88, 0x2b, 0xc00045a300, 0x1, 0x1)
        /<email address hidden>/entry.go:230 +0xd4
github.com/sirupsen/logrus.(*Logger).Panicf(0xc0000b8050, 0x1b11d88, 0x2b, 0xc00045a300, 0x1, 0x1)
        /<email address hidden>/logger.go:173 +0x86
github.com/sirupsen/logrus.Panicf(...)
        /<email address hidden>/exported.go:145
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).shutdownXDPCompletely(0xc0001f9680)
        /<email address hidden>/dataplane/linux/int_dataplane.go:779 +0x2cd
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).doStaticDataplaneConfig(0xc0001f9680)
        /<email address hidden>/dataplane/linux/int_dataplane.go:724 +0xc22
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).Start(0xc0001f9680)
        /<email address hidden>/dataplane/linux/int_dataplane.go:584 +0x2f
github.com/projectcalico/felix/dataplane.StartDataplaneDriver(0xc000530000, 0xc0004fb2c0, 0xc0003c9ba0, 0x1, 0xc0003d17c0, 0x0)
        /<email address hidden>/dataplane/driver.go:186 +0xf09
github.com/projectcalico/felix/daemon.Run(0x1ae3b30, 0x15, 0x1db1ff0, 0x7, 0x1e083e0, 0x28, 0x1ddf000, 0x18)
        /<email address hidden>/daemon/daemon.go:304 +0x18d7
main.main()
        /go/src/github.com/projectcalico/node/cmd/calico-node/main.go:100 +0x405
...

Note we have "disable-vxlan-tx-checksumming" set to "true" per the default in the charm.

An example of this failure can be found here: https://solutions.qa.canonical.com/testruns/testRun/6d914153-cae3-4445-a444-668b7c3c9650
Along with its crashdump: https://oil-jenkins.canonical.com/artifacts/6d914153-cae3-4445-a444-668b7c3c9650/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-08-30-04.17.50.tar.gz
And bundle: https://oil-jenkins.canonical.com/artifacts/6d914153-cae3-4445-a444-668b7c3c9650/generated/generated/kubernetes/bundle.yaml

All occurrences of this bug can be found here: https://solutions.qa.canonical.com/bugs/bugs/bug/1942099

Michael Skalka (mskalka)
description: updated
tags: added: cdo-relase-blocker cdoqa foundation-engine
tags: added: cdo-qa cdo-release-blocker foundations-engine
removed: cdo-relase-blocker cdoqa foundation-engine
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

K8s engr determined this to be caused by a change to the lxd-profile.xml in both k8s-master and -worker charms. Originally changed for https://bugs.launchpad.net/snapd/+bug/1907153, we decided a workaround for that issue was the best path forward to ensure calico/vxlan/lxd functionality did not regress.

PRs:
- https://github.com/charmed-kubernetes/charm-kubernetes-master/pull/177
- https://github.com/charmed-kubernetes/charm-kubernetes-worker/pull/97

Changed in charm-calico:
status: New → Invalid
Changed in charm-kubernetes-master:
status: New → Fix Committed
Changed in charm-kubernetes-worker:
status: New → Fix Committed
Changed in charm-kubernetes-master:
importance: Undecided → Critical
Changed in charm-kubernetes-worker:
importance: Undecided → Critical
Changed in charm-kubernetes-master:
assignee: nobody → Kevin W Monroe (kwmonroe)
Changed in charm-kubernetes-worker:
assignee: nobody → Kevin W Monroe (kwmonroe)
Changed in charm-kubernetes-master:
milestone: none → 1.22
Changed in charm-kubernetes-worker:
milestone: none → 1.22
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-worker:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.