Charmed Kubernetes Testing

kube-ovn controller pod is in crashloop backoff after cidr-expansion

Bug #1995139 reported by Adam Dyess on 2022-10-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Charmed Kubernetes Testing	Fix Released	High	George Kraft	Charmed Kubernetes Testing 1.26
	Kubernetes Control Plane Charm	Fix Released	High	George Kraft	Kubernetes Control Plane Charm 1.26

Bug Description

while running the end-to-end validation tests from jenkins [0], a test failure usually manifests in the wrong spot after the cidr expansion tests rather than during. This was revealed by a CrashLoopBackoff Pod in the kube-ovn-controller deployment [1]. I checked the deployment config [2] and it had the right expanded cidr

Steps I believe are necessary to reproduce:
```
juju add-model kubernetes-ovn
juju deploy charmed-kubernetes --overlay overlays/kube-ovn.yaml
juju-wait
tox -e py -- .tox/py/bin/pytest jobs/integration/validation.py --cloud $CLOUD --controller $CONTROLLER --model kubernetes-ovn -k "cidr_expansion and toggle_metrics"
```

[0]: https://github.com/charmed-kubernetes/jenkins/blob/main/jobs/integration/validation.py
[1]: https://paste.ubuntu.com/p/ytMZMfRGd8/
[2]: https://paste.ubuntu.com/p/tt9vYcPTrm/

Revision history for this message

Adam Dyess (addyess) wrote on 2022-10-28:

After investigating the crash, and while writing this bug report, eventually the deployment stabilized and kube-ovn-controller was up without crashes. This may be why the rest of the tests in the suite continue normally. Perhaps there is a longer recovery time than expected after the cidr expansion is begun?

Revision history for this message

Adam Dyess (addyess) wrote on 2022-10-28:

After it was stable for 15m, I ran ONLY the `toggle_metrics` test again, and the kube-ovn-controller began crashing again. It may not be COMPLETELY related to this test, but something about what changing the metrics-server config does within the control-plane charm that exacerbates the crash

Revision history for this message

Adam Dyess (addyess) wrote on 2022-10-28 (last edit on 2022-10-28):

containerd go1.18 active 5 containerd stable 41 no Container runtime available
easyrsa 3.0.1 active 1 easyrsa stable 26 no Certificate Authority connected.
etcd 3.4.5 active 3 etcd stable 718 no Healthy with 3 known peers
kube-ovn active 5 kube-ovn edge 34 no
kubeapi-load-balancer 1.18.0 active 1 kubeapi-load-balancer stable 42 yes Loadbalancer ready.
kubernetes-control-plane 1.25.3 active 2 kubernetes-control-plane edge 208 no Kubernetes control-plane running.
kubernetes-worker 1.25.3 active 3 kubernetes-worker edge 72 yes Kubernetes worker running.

Revision history for this message

Adam Dyess (addyess) wrote on 2022-10-28:

Things look good in the kube-controller logs while it adds the metrics-server pod [0]

But then the api-server is restarted by kubernetes-control-plane and i lose the logs. When the api-server is back up, i can grab some of the crash logs [1]

[0]: https://paste.ubuntu.com/p/XJRXDngbR3/
[1]: https://paste.ubuntu.com/p/KfmwBfqYxK/

Revision history for this message

George Kraft (cynerva) wrote on 2022-10-28:

I can repro this. The kube-ovn-controller pod enters CrashLoopBackOff because of disrupted access to the Kubernetes API.

When test_service_cidr_expansion runs, there's a brief period where kube-apiserver serves with an old certificate that doesn't have the new 10.152.182.1 address in its SANs. This causes x509 errors that cause kube-ovn-controller to crash.

When test_toggle_metrics runs, it toggles the enable-metrics config, which causes kube-apiserver to get reconfigured and restarted. This causes "connection refused" errors that cause kube-ovn-controller to crash.

Each time it crashes, the backoff gets exponentially worse, up to a cap of 5 minutes between each restart. That's longer than the timeout of test_toggle_metrics.

I recommend fixing this in two ways:
1. Raise the test_toggle_metrics timeout to 10 minutes
2. Update kubernetes-control-plane so it doesn't reconfigure kube-apiserver every time enable-metrics changes

no longer affects:	charm-kube-ovn
Changed in charmed-kubernetes-testing:
milestone:	none → 1.25+ck3
Changed in charm-kubernetes-master:
milestone:	none → 1.25+ck3
Changed in charmed-kubernetes-testing:
status:	New → Triaged
Changed in charm-kubernetes-master:
status:	New → Triaged
Changed in charmed-kubernetes-testing:
assignee:	nobody → George Kraft (cynerva)
Changed in charm-kubernetes-master:
assignee:	nobody → George Kraft (cynerva)
Changed in charmed-kubernetes-testing:
status:	Triaged → In Progress
Changed in charm-kubernetes-master:
status:	Triaged → In Progress
Changed in charmed-kubernetes-testing:
importance:	Undecided → High
Changed in charm-kubernetes-master:
importance:	Undecided → High

Revision history for this message

George Kraft (cynerva) wrote on 2022-10-28:

PRs:
https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/pull/254
https://github.com/charmed-kubernetes/jenkins/pull/1088

Kevin W Monroe (kwmonroe) on 2022-11-09

Changed in charmed-kubernetes-testing:
status:	In Progress → Fix Committed
Changed in charm-kubernetes-master:
status:	In Progress → Fix Committed

Kevin W Monroe (kwmonroe) on 2022-11-30

Changed in charmed-kubernetes-testing:
milestone:	1.25+ck3 → 1.26
Changed in charm-kubernetes-master:
milestone:	1.25+ck3 → 1.26

Adam Dyess (addyess) on 2022-12-15

Changed in charmed-kubernetes-testing:
status:	Fix Committed → Fix Released
Changed in charm-kubernetes-master:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.