controller restart meant sidecar charm k8s workloads restarts

Bug #2036594 reported by Tom Haddon
52
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Harry Pidcock
3.1
Triaged
High
Unassigned
3.2
Triaged
High
Unassigned

Bug Description

We recently had a controller restart to run a mgopurge to try and address some performance issues with the controllers (juju status taking more than 2 minutes on particular models, for instance). Here's what was done (sorry, Canonical internal only): https://pastebin.canonical.com/p/rkH6RNJXgJ/

In doing so, we saw k8s models attached to this cluster get pods rescheduled. We assume this is because pebble was having problems contacted the controller during the restarts. Here's a charm log from the time of the incident: https://pastebin.canonical.com/p/8JWNMkB8y3/

The controller and model version is juju 2.9.44.

Tags: canonical-is
Tom Haddon (mthaddon)
tags: added: canonical-is
description: updated
Revision history for this message
Tom Haddon (mthaddon) wrote :

I've been able to reproduce this locally. If I deploy juju 3.1.5 on microk8s and then deploy a sidecar charm into a model (in my case I've been testing with discourse-k8s) I'm able to go from the application working fine to the charm container being restarted by running `/opt/pebble stop jujud` in the api-server container of the controller-0 pod.

Here are the logs from the charm container before it's killed from the point I run `/opt/pebble stop jujud`:

2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.uniter uniter.go:338 unit "discourse-k8s/0" shutting down: catacomb 0xc00054e000 is dying
2023-09-19T15:25:51.971Z [pebble] Check "liveness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:25:51.972Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "liveness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:04.589Z [container-agent] 2023-09-19 15:26:04 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure threshold 3 hit, triggering action
2023-09-19T15:26:21.970Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:21.970Z [pebble] Check "liveness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:25.552Z [container-agent] 2023-09-19 15:26:25 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:31.970Z [pebble] Check "liveness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:31.970Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "liveness" failure 6 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "readiness" failure 6 (threshold 3): received non-20x status code 404

Haw Loeung (hloeung)
Changed in juju:
status: New → Confirmed
Revision history for this message
Harry Pidcock (hpidcock) wrote (last edit ):

I think the correct course of action here is to change the uniter's influence on the readiness/liveness to be nil and just have influence over the startup probe.

Changed in juju:
importance: Undecided → High
milestone: none → 2.9.46
status: Confirmed → Triaged
assignee: nobody → Harry Pidcock (hpidcock)
Revision history for this message
Harry Pidcock (hpidcock) wrote :

Fix for https://bugs.launchpad.net/juju/+bug/2037478 mitigates this somewhat, reducing the importance of this one.

Changed in juju:
importance: High → Medium
Revision history for this message
Haw Loeung (hloeung) wrote :

What's changed in LP:2037478? I see it's linked to https://github.com/juju/juju/pull/16325/files which doesn't have much?

Revision history for this message
Harry Pidcock (hpidcock) wrote :

LP:2037478 is dealing with specifically with if the controller addresses have changed (i.e. model migration, api addresses changing, new ha controller machines etc) or something else in agent.conf changed, that if this error (LP:2036594) is triggered, it causes a failure that requires manual intervention (i.e. delete the pods or manually update the template-agent.conf).

If we just fix LP:2037478, worst case the charm containers just bounce, pod becomes unhealthy. Still fixing this bug, it just might happen in a few weeks.

John A Meinel (jameinel)
Changed in juju:
importance: Medium → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.