test_network_policies fails on "Reaching out to nginx.netpolicy" with ignore-loose-rpf set to True

Bug #1916927 reported by Joshua Genet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Testing
Triaged
Medium
Unassigned

Bug Description

As seen here: https://solutions.qa.canonical.com/testruns/testRun/f2001fbd-d680-4b28-a26d-b9e066059278

Crashdump here: https://oil-jenkins.canonical.com/artifacts/f2001fbd-d680-4b28-a26d-b9e066059278/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-02-25-07.40.03.tar.gz

---

This issue was caused in the past by not setting Calico ignore-loose-rpf to True. This time we do set it in our bundle here:
https://oil-jenkins.canonical.com/artifacts/f2001fbd-d680-4b28-a26d-b9e066059278/generated/generated/kubernetes/bundle.yaml

And you can see it is getting set several places in the logs with a message like this:
0/baremetal/var/log/calico/felix/current:
2021-02-25 06:20:10.169 [WARNING][67] int_dataplane.go 1026: Kernel's RPF check is set to 'loose' and IgnoreLooseRPF set to true. Calico will not be able to prevent workloads from spoofing their source IP. Please ensure that some other anti-spoofing mechanism is in place (such as running only non-privileged containers).

Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report. Slightly different symptom this time, as the "Reaching out to nginx.netpolicy with restrictions" message is only logged once:

2021-02-25-07:06:50 root DEBUG Reaching out to nginx.netpolicy with no restrictions
2021-02-25-07:07:28 root DEBUG Reaching out to nginx.netpolicy with no restrictions
2021-02-25-07:07:41 root DEBUG Reaching out to nginx.netpolicy with restrictions
2021-02-25-07:36:00 root ERROR [localhost] Command failed: ...

Looks like the test is hanging here: https://github.com/charmed-kubernetes/jenkins/blob/9f180e2be0d209a6b82be93bda8f9623cd133bf8/jobs/integration/validation.py#L543-L552

At 07:09:03, kubernetes-master/2 acquires a machine lock for action 108 and never releases it:

2021-02-25 07:09:03 DEBUG juju.machinelock machinelock.go:172 machine lock acquired for kubernetes-master/2 uniter (run action 108)
2021-02-25 07:09:03 DEBUG juju.worker.uniter.operation executor.go:132 preparing operation "run action 108" for kubernetes-master/2
2021-02-25 07:09:03 DEBUG juju.worker.uniter.operation executor.go:132 executing operation "run action 108" for kubernetes-master/2
2021-02-25 07:09:03 DEBUG juju.worker.uniter.runner runner.go:288 juju-run action is running

That action is definitely holding things up. Unfortunately, I'm not able to find which action that is or what command it ran. That info doesn't appear to be collected in the crashdump.

Revision history for this message
George Kraft (cynerva) wrote :

For test_network_policy I recommend the following changes:
1) Add --request-timeout=1m to the `kubectl exec` calls.
2) Use ssh instead of unit.run to prevent it from acquiring the machine lock (which can prevent the debug action from running).

I have also requested changes to juju-crashdump that would help us diagnose future failures in more detail:
1) https://github.com/juju/juju-crashdump/issues/81
2) https://github.com/juju/juju-crashdump/issues/82

Changed in charmed-kubernetes-testing:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Joshua Genet (genet022) wrote :

I spaced responding here right away, but thanks for looking.
We'll get those juju-crashdump changes added and appreciate the issues you filed!

Revision history for this message
Alexander Balderson (asbalderson) wrote :
George Kraft (cynerva)
Changed in charmed-kubernetes-testing:
status: Triaged → New
Revision history for this message
George Kraft (cynerva) wrote :

Thank you. The new crashdump helps. The hung process is:

root 239311 0.0 0.0 8696 3424 ? S 19:37 0:00 /bin/bash /tmp/juju-exec025225977/script.sh
root 239312 0.0 0.1 746368 39576 ? Sl 19:37 0:00 /snap/kubectl/1927/kubectl --kubeconfig /root/.kube/config delete ns netpolicy

kube-controller-manager logs the failure to delete the namespace:

E0421 19:37:07.995259 148767 namespace_controller.go:162] deletion of namespace netpolicy failed: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

The metrics-server pod is in CrashLoopBackOff. The logs were empty at the time this crashdump was captured, but I was able to find the error from metrics-server in an the originally reported crashdump:

Error: Get https://10.152.183.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: net/http: TLS handshake timeout

I believe this is caused by conflict between the Calico CIDR (192.168.0.0/16) and the internal network space (192.168.33.135/??). I checked the most recent crashdump and it does appear that the conflict is still occurring.

I recommend setting the Calico charm's cidr config to something that does not collide with the host networking.

Revision history for this message
George Kraft (cynerva) wrote :

I'll leave this open against charmed-kubernetes-testing to make some improvements to the test. In addition to my recommendations in comment #2, I also recommend the test log a message prior to deleting the namespace so that if it gets stuck again, we know where it's stuck.

Changed in charmed-kubernetes-testing:
importance: High → Medium
status: New → Triaged
Revision history for this message
George Kraft (cynerva) wrote :

But to be clear, I believe this ultimately needs to be fixed in your calico CIDR configuration.

Revision history for this message
Joshua Genet (genet022) wrote :

We ran into this again today (it's been awhile).

Here's the run:
https://solutions.qa.canonical.com/testruns/testRun/8e8c272a-d3f9-4945-8db3-ab265fc1633e

In the run comments you can see our calico config with the CIDR.

We also have this run from yesterday that succeeded using the exact same hardware and config (including our calico CIDR):
https://solutions.qa.canonical.com/testruns/testRun/37f1d6c7-9c13-4cc5-84a9-553f32e6af1e

---

This suggests that either our CIDR config is fine or that our CIDR config is incorrect and causing *spurious* errors rather than consistent errors. Any insight?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.