Bug #2036594 “controller restart meant sidecar charm k8s workloa...” : Series 3.1 : Bugs : Canonical Juju

Tom Haddon (mthaddon) on 2023-09-19

tags:	added: canonical-is
description:	updated

Revision history for this message

Tom Haddon (mthaddon) wrote on 2023-09-19:

#1

I've been able to reproduce this locally. If I deploy juju 3.1.5 on microk8s and then deploy a sidecar charm into a model (in my case I've been testing with discourse-k8s) I'm able to go from the application working fine to the charm container being restarted by running `/opt/pebble stop jujud` in the api-server container of the controller-0 pod.

Here are the logs from the charm container before it's killed from the point I run `/opt/pebble stop jujud`:

2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.uniter uniter.go:338 unit "discourse-k8s/0" shutting down: catacomb 0xc00054e000 is dying
2023-09-19T15:25:51.971Z [pebble] Check "liveness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:25:51.972Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "liveness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:04.589Z [container-agent] 2023-09-19 15:26:04 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure threshold 3 hit, triggering action
2023-09-19T15:26:21.970Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:21.970Z [pebble] Check "liveness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:25.552Z [container-agent] 2023-09-19 15:26:25 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:31.970Z [pebble] Check "liveness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:31.970Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "liveness" failure 6 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "readiness" failure 6 (threshold 3): received non-20x status code 404

I've been able to reproduce this locally. If I deploy juju 3.1.5 on microk8s and then deploy a sidecar charm into a model (in my case I've been testing with discourse-k8s) I'm able to go from the application working fine to the charm container being restarted by running `/opt/pebble stop jujud` in the api-server container of the controller-0 pod.

Here are the logs from the charm container before it's killed from the point I run `/opt/pebble stop jujud`:

2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.uniter uniter.go:338 unit "discourse-k8s/0" shutting down: catacomb 0xc00054e000 is dying
2023-09-19T15:25:51.971Z [pebble] Check "liveness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:25:51.972Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "liveness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:04.589Z [container-agent] 2023-09-19 15:26:04 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure threshold 3 hit, triggering action
2023-09-19T15:26:21.970Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:21.970Z [pebble] Check "liveness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:25.552Z [container-agent] 2023-09-19 15:26:25 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:31.970Z [pebble] Check "liveness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:31.970Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "liveness" failure 6 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "readiness" failure 6 (threshold 3): received non-20x status code 404

Haw Loeung (hloeung) on 2023-09-19

Changed in juju:
status:	New → Confirmed

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-09-19 (last edit on 2023-09-19):

#2

I think the correct course of action here is to change the uniter's influence on the readiness/liveness to be nil and just have influence over the startup probe.

Changed in juju:
importance:	Undecided → High
milestone:	none → 2.9.46
status:	Confirmed → Triaged
assignee:	nobody → Harry Pidcock (hpidcock)

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-09-27:

#3

Fix for https://bugs.launchpad.net/juju/+bug/2037478 mitigates this somewhat, reducing the importance of this one.

Changed in juju:
importance:	High → Medium

Revision history for this message

Haw Loeung (hloeung) wrote on 2023-09-27:

#4

What's changed in LP:2037478? I see it's linked to https://github.com/juju/juju/pull/16325/files which doesn't have much?

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-09-27:

#5

LP:2037478 is dealing with specifically with if the controller addresses have changed (i.e. model migration, api addresses changing, new ha controller machines etc) or something else in agent.conf changed, that if this error (LP:2036594) is triggered, it causes a failure that requires manual intervention (i.e. delete the pods or manually update the template-agent.conf).

If we just fix LP:2037478, worst case the charm containers just bounce, pod becomes unhealthy. Still fixing this bug, it just might happen in a few weeks.

John A Meinel (jameinel) on 2023-10-05

Changed in juju:
importance:	Medium → High

	Status	Importance	Assigned to	Milestone
Canonical Juju	Triaged	High	Harry Pidcock	Canonical Juju 2.9.46
3.1	Triaged	High	Unassigned	Canonical Juju 3.1.7
3.2	Triaged	High	Unassigned	Canonical Juju 3.2.4

Canonical Juju

controller restart meant sidecar charm k8s workloads restarts

Bug Description

Other bug subscribers

Remote bug watches