Sometimes CK model deployment gets stuck with etcd and calico colocated on the same machine
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Calico Charm |
Fix Released
|
High
|
Adam Dyess | ||
Etcd Charm |
Fix Released
|
High
|
George Kraft |
Bug Description
Hi team.
We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample.
First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https:/
- Calico charm may take machine-wide juju lock while calling calicoctl
We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample: calico charm calls calicoctl to save some calico data, such as pool configuration into etcd cluster before the cluster has been initialized, which causes calicoctl to hang. As the charm calls calicoctl taking juju machine lock, this causes the whole machine to freeze waiting for calicoctl to terminate, and that never happens because of the calicoctl issue. Because calico runs on all the K8s nodes, it seems like the whole model gets stuck.
I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.
description: | updated |
Changed in charm-calico: | |
status: | Triaged → Fix Committed |
Changed in charm-etcd: | |
status: | Triaged → Fix Committed |
Changed in charm-calico: | |
assignee: | nobody → Adam Dyess (addyess) |
milestone: | none → 1.26+ck3 |
Changed in charm-etcd: | |
assignee: | nobody → George Kraft (cynerva) |
milestone: | none → 1.26+ck3 |
tags: | added: backport-needed |
Changed in charm-calico: | |
status: | Fix Committed → Fix Released |
Changed in charm-etcd: | |
status: | Fix Committed → Fix Released |
Thanks for the report. I would say this affects both etcd and calico.
For etcd: Units are sending cluster connection details before etcd is ready[1]. It should delay sending cluster connection details until after etcd has successfully registered (i.e. wait for the "etcd.registered" flag).
For calico: Units are letting a hung calicoctl process block the machine lock indefinitely. It should wrap calicoctl calls[2] with a timeout so that the cluster can eventually unstick itself in case of similar issues.
As a workaround, I suspect if you kill hung calicoctl processes repeatedly, Juju will eventually get through its backlog of hooks and allow the etcd units to progress.
[1]: https:/ /github. com/charmed- kubernetes/ layer-etcd/ blob/ae98be0046 953ced628f682ee e266d0e875a62b0 /reactive/ etcd.py# L283-L287 /github. com/charmed- kubernetes/ layer-calico/ blob/2287a08ea5 c7940bbe9b07be1 79e1da15b51cba1 /reactive/ calico. py#L615- L624
[2]: https:/