error: cannot refresh "kubectl": snap "kubectl" has running apps (kubectl)

Bug #1987331 reported by Chris Johnston
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Triaged
High
George Kraft
Kubernetes Worker Charm
Triaged
High
George Kraft

Bug Description

When new revisions for the kubectl snap are released we are seeing this error in the k-c-p logs as well as the unit being in error state:

2022-06-19 23:57:18 INFO unit.kubernetes-master/5.juju-log server.go:327 coordinator:11: status-set: maintenance: Joining snap cohort.
2022-06-19 23:57:18 WARNING unit.kubernetes-master/5.coordinator-relation-changed logger.go:60 error: cannot refresh "kubectl": snap "kubectl" has running apps (kubectl)
2022-06-19 23:57:18 ERROR unit.kubernetes-master/5.juju-log server.go:327 coordinator:11: Hook error:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
bus.dispatch(restricted=restricted_mode)
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
_invoke(other_handlers)
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
handler.invoke()
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
self._action(*args)
File "/var/lib/juju/agents/unit-kubernetes-master-5/charm/reactive/kubernetes_master.py", line 469, in join_or_update_cohorts
snap.join_cohort_snapshot(snapname, cohort_key)
File "lib/charms/layer/snap.py", line 455, in join_cohort_snapshot
subprocess.check_output(["snap", "refresh", snapname, "--cohort", cohort_key])
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['snap', 'refresh', 'kubectl', '--cohort', 'REDACTED']' returned non-zero exit status 1.

I believe that this is similar to LP#1978005 [1] and LP#1975714 [2]. The fix for this is currently in version 2.57, which is in candidate. Once this is released, AIUI the kubectl snap would need to be changed to set `refresh-mode: ignore-running` in order for refreshes to be processed [3][4].

[1] https://bugs.launchpad.net/snapd/+bug/1978005
[2] https://bugs.launchpad.net/snapd/+bug/1975714
[3] https://github.com/snapcore/snapd/pull/11855/files
[4] https://chat.canonical.com/canonical/pl/47m58wxjt3bzfehpedogg7dwio

Revision history for this message
George Kraft (cynerva) wrote :

I'm able to repro by running a snap refresh to a different revision while running kubectl in the background.

The documentation around `refresh-mode: ignore-running` suggests that it's for daemons[1], but I'll give it a try and see if it works for non-daemon apps too.

[1]: https://snapcraft.io/docs/services-and-daemons

Changed in charm-kubernetes-master:
status: New → Confirmed
Revision history for this message
George Kraft (cynerva) wrote :

Hmm, nope, can't apply the refresh-mode option to a non-daemon command:

$ snapcraft
Issues while validating snapcraft.yaml: The 'apps/kubectl' property does not match the required schema: 'daemon' is a dependency of 'refresh-mode'

I'll see if I can find another way.

Revision history for this message
George Kraft (cynerva) wrote :

I haven't had any luck finding a way to fix this in snapcraft.yaml and I haven't received a response from the snap team.

The snap refresh command has an undocumented --ignore-running option that seems to do the trick, so we could at least fix this in the charms by utilizing that.

Changed in charm-kubernetes-master:
status: Confirmed → Triaged
Changed in charm-kubernetes-worker:
status: New → Triaged
Changed in charm-kubernetes-master:
importance: Undecided → High
Changed in charm-kubernetes-worker:
importance: Undecided → High
Revision history for this message
George Kraft (cynerva) wrote :

I'm targeting this for 1.29 initially. We are in the middle of a complete rewrite of the kubernetes-control-plane and kubernetes-worker charms, so a backport to 1.28 will be nontrivial. If you do need it prior to 1.29, let me know.

George Kraft (cynerva)
Changed in charm-kubernetes-master:
milestone: none → 1.29
Changed in charm-kubernetes-worker:
milestone: none → 1.29
Changed in charm-kubernetes-master:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-worker:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: Triaged → In Progress
Changed in charm-kubernetes-worker:
status: Triaged → In Progress
Revision history for this message
George Kraft (cynerva) wrote :
George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charm-kubernetes-worker:
status: In Progress → Fix Committed
tags: added: backport-needed
Revision history for this message
George Kraft (cynerva) wrote :

I've reopened this. While the existing PR should help, recent evidence suggests that kubectl pids are being leaked by the auth-webhook when it fetches token secrets[1]. I strongly suspect that the run function's timeout mechanism[2] is leaving behind stale processes.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/9ca52889800937509bd0065d285c6646e04cb745/templates/cdk.master.auth-webhook.py#L347-L349
[2]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/9ca52889800937509bd0065d285c6646e04cb745/templates/cdk.master.auth-webhook.py#L52-L61

Changed in charm-kubernetes-master:
status: Fix Committed → Triaged
Changed in charm-kubernetes-worker:
status: Fix Committed → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.