Kubernetes Worker Charm

error: cannot refresh "kubectl": snap "kubectl" has running apps (kubectl)

Bug #1987331 reported by Chris Johnston on 2022-08-22

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Control Plane Charm	Triaged	High	George Kraft	Kubernetes Control Plane Charm 1.29
	Kubernetes Worker Charm	Triaged	High	George Kraft	Kubernetes Worker Charm 1.29

Bug Description

When new revisions for the kubectl snap are released we are seeing this error in the k-c-p logs as well as the unit being in error state:

2022-06-19 23:57:18 INFO unit.kubernetes-master/5.juju-log server.go:327 coordinator:11: status-set: maintenance: Joining snap cohort.
2022-06-19 23:57:18 WARNING unit.kubernetes-master/5.coordinator-relation-changed logger.go:60 error: cannot refresh "kubectl": snap "kubectl" has running apps (kubectl)
2022-06-19 23:57:18 ERROR unit.kubernetes-master/5.juju-log server.go:327 coordinator:11: Hook error:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
bus.dispatch(restricted=restricted_mode)
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
_invoke(other_handlers)
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
handler.invoke()
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
self._action(*args)
File "/var/lib/juju/agents/unit-kubernetes-master-5/charm/reactive/kubernetes_master.py", line 469, in join_or_update_cohorts
snap.join_cohort_snapshot(snapname, cohort_key)
File "lib/charms/layer/snap.py", line 455, in join_cohort_snapshot
subprocess.check_output(["snap", "refresh", snapname, "--cohort", cohort_key])
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['snap', 'refresh', 'kubectl', '--cohort', 'REDACTED']' returned non-zero exit status 1.

I believe that this is similar to LP#1978005 [1] and LP#1975714 [2]. The fix for this is currently in version 2.57, which is in candidate. Once this is released, AIUI the kubectl snap would need to be changed to set `refresh-mode: ignore-running` in order for refreshes to be processed [3][4].

[1] https://bugs.launchpad.net/snapd/+bug/1978005
[2] https://bugs.launchpad.net/snapd/+bug/1975714
[3] https://github.com/snapcore/snapd/pull/11855/files
[4] https://chat.canonical.com/canonical/pl/47m58wxjt3bzfehpedogg7dwio

Tags:

Revision history for this message

George Kraft (cynerva) wrote on 2023-08-02:

I'm able to repro by running a snap refresh to a different revision while running kubectl in the background.

The documentation around `refresh-mode: ignore-running` suggests that it's for daemons[1], but I'll give it a try and see if it works for non-daemon apps too.

[1]: https://snapcraft.io/docs/services-and-daemons

Changed in charm-kubernetes-master:
status:	New → Confirmed

Revision history for this message

George Kraft (cynerva) wrote on 2023-08-02:

Hmm, nope, can't apply the refresh-mode option to a non-daemon command:

$ snapcraft
Issues while validating snapcraft.yaml: The 'apps/kubectl' property does not match the required schema: 'daemon' is a dependency of 'refresh-mode'

I'll see if I can find another way.

Revision history for this message

George Kraft (cynerva) wrote on 2023-08-03:

I haven't had any luck finding a way to fix this in snapcraft.yaml and I haven't received a response from the snap team.

The snap refresh command has an undocumented --ignore-running option that seems to do the trick, so we could at least fix this in the charms by utilizing that.

Changed in charm-kubernetes-master:
status:	Confirmed → Triaged
Changed in charm-kubernetes-worker:
status:	New → Triaged
Changed in charm-kubernetes-master:
importance:	Undecided → High
Changed in charm-kubernetes-worker:
importance:	Undecided → High

Revision history for this message

George Kraft (cynerva) wrote on 2023-08-03:

I'm targeting this for 1.29 initially. We are in the middle of a complete rewrite of the kubernetes-control-plane and kubernetes-worker charms, so a backport to 1.28 will be nontrivial. If you do need it prior to 1.29, let me know.

George Kraft (cynerva) on 2023-08-04

Changed in charm-kubernetes-master:
milestone:	none → 1.29
Changed in charm-kubernetes-worker:
milestone:	none → 1.29
Changed in charm-kubernetes-master:
assignee:	nobody → George Kraft (cynerva)
Changed in charm-kubernetes-worker:
assignee:	nobody → George Kraft (cynerva)
Changed in charm-kubernetes-master:
status:	Triaged → In Progress
Changed in charm-kubernetes-worker:
status:	Triaged → In Progress

Revision history for this message

George Kraft (cynerva) wrote on 2023-08-04:

PR: https://github.com/charmed-kubernetes/charm-lib-kubernetes-snaps/pull/3

George Kraft (cynerva) on 2023-08-04

Changed in charm-kubernetes-master:
status:	In Progress → Fix Committed
Changed in charm-kubernetes-worker:
status:	In Progress → Fix Committed

Chris Johnston (cjohnston) on 2023-08-04

tags:

added: backport-needed

Revision history for this message

George Kraft (cynerva) wrote on 2023-09-14:

I've reopened this. While the existing PR should help, recent evidence suggests that kubectl pids are being leaked by the auth-webhook when it fetches token secrets[1]. I strongly suspect that the run function's timeout mechanism[2] is leaving behind stale processes.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/9ca52889800937509bd0065d285c6646e04cb745/templates/cdk.master.auth-webhook.py#L347-L349
[2]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/9ca52889800937509bd0065d285c6646e04cb745/templates/cdk.master.auth-webhook.py#L52-L61

Changed in charm-kubernetes-master:
status:	Fix Committed → Triaged
Changed in charm-kubernetes-worker:
status:	Fix Committed → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.