Calico Charm

Sometimes CK model deployment gets stuck with etcd and calico colocated on the same machine

Bug #2008267 reported by Nikolay Vinogradov on 2023-02-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Calico Charm	Fix Released	High	Adam Dyess	Calico Charm 1.26+ck3
	Etcd Charm	Fix Released	High	George Kraft	Etcd Charm 1.26+ck3

Bug Description

Hi team.

We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample.

First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl

We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample: calico charm calls calicoctl to save some calico data, such as pool configuration into etcd cluster before the cluster has been initialized, which causes calicoctl to hang. As the charm calls calicoctl taking juju machine lock, this causes the whole machine to freeze waiting for calicoctl to terminate, and that never happens because of the calicoctl issue. Because calico runs on all the K8s nodes, it seems like the whole model gets stuck.

I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.

See original description

Revision history for this message

Nikolay Vinogradov (nikolay.vinogradov) wrote on 2023-02-23:

model-stuck-juju-status.txt Edit (7.8 KiB, text/plain)

description:	updated
description:	updated

Nikolay Vinogradov (nikolay.vinogradov) on 2023-02-23

description:

updated

Revision history for this message

George Kraft (cynerva) wrote on 2023-02-23:

Thanks for the report. I would say this affects both etcd and calico.

For etcd: Units are sending cluster connection details before etcd is ready[1]. It should delay sending cluster connection details until after etcd has successfully registered (i.e. wait for the "etcd.registered" flag).

For calico: Units are letting a hung calicoctl process block the machine lock indefinitely. It should wrap calicoctl calls[2] with a timeout so that the cluster can eventually unstick itself in case of similar issues.

As a workaround, I suspect if you kill hung calicoctl processes repeatedly, Juju will eventually get through its backlog of hooks and allow the etcd units to progress.

[1]: https://github.com/charmed-kubernetes/layer-etcd/blob/ae98be0046953ced628f682eee266d0e875a62b0/reactive/etcd.py#L283-L287
[2]: https://github.com/charmed-kubernetes/layer-calico/blob/2287a08ea5c7940bbe9b07be179e1da15b51cba1/reactive/calico.py#L615-L624

Changed in charm-calico:
importance:	Undecided → High
Changed in charm-etcd:
importance:	Undecided → High
Changed in charm-calico:
status:	New → Triaged
Changed in charm-etcd:
status:	New → Triaged

Revision history for this message

Adam Dyess (addyess) wrote on 2023-03-01:

address calicoctl hanging
https://github.com/charmed-kubernetes/layer-calico/pull/95

Revision history for this message

Adam Dyess (addyess) wrote on 2023-03-02:

Address etcd charm issue
https://github.com/charmed-kubernetes/layer-etcd/pull/207

Adam Dyess (addyess) on 2023-03-02

Changed in charm-calico:
status:	Triaged → Fix Committed

Kevin W Monroe (kwmonroe) on 2023-03-07

Changed in charm-etcd:
status:	Triaged → Fix Committed
Changed in charm-calico:
assignee:	nobody → Adam Dyess (addyess)
milestone:	none → 1.26+ck3
Changed in charm-etcd:
assignee:	nobody → George Kraft (cynerva)
milestone:	none → 1.26+ck3