[Juju 3.2] stop hook occasionally fails for no apparent reason

Bug #2025411 reported by Leon

This bug report was marked for expiration 304 days ago. (find out why)

6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Undecided
Unassigned

Bug Description

In grafana, we have this[1] on stop:

self.unit.status = MaintenanceStatus("Application is terminating.")

In alertmanager, stop is not even observed.

However, occasionally, integration tests fail[2] with:

INFO juju.model:model.py:2690 Waiting for model:
  alertmanager/0 [idle] error: hook failed: "stop"
  grafana/0 [idle] error: hook failed: "stop"

[1] https://github.com/canonical/grafana-k8s-operator/blob/1c6c586d4ed8bf9b0b95fde24beba2c20da80ce8/src/charm.py#L376

[2] https://github.com/canonical/cos-lite-bundle/actions/runs/5406674418/jobs/9823831379

Revision history for this message
Leon (sed-i) wrote :
Revision history for this message
Harry Pidcock (hpidcock) wrote :

This is probably due to patching the statefulset, which is causing the pods to be recreated.

Revision history for this message
Joseph Phillips (manadart) wrote :

What should the status be Harry?

Revision history for this message
John A Meinel (jameinel) wrote :

So you say:
"In alertmanager, stop is not even observed."
But then you say:
"INFO juju.model:model.py:2690 Waiting for model:
  alertmanager/0 [idle] error: hook failed: "stop""

However, that clearly indicates that stop was observed, as you got an error in the hook event.

Plausibly (given the logs that you linked to) you meant to say:
"In traefik, stop is not even observed."

I can't tell, because I do see the statuses getting to:
"
INFO juju.model:model.py:2690 Waiting for model:
  grafana/0 [idle] error: hook failed: "stop"
  loki/0 [idle] error: hook failed: "stop"
  traefik/0 [idle] active:
INFO juju.model:model.py:2690 Waiting for model:
  grafana/0 [idle] error: hook failed: "stop"
  loki/0 [idle] error: hook failed: "stop"
INFO juju.model:model.py:2690 Waiting for model:
  grafana/0 [idle] error: hook failed: "stop"
  loki/0 [idle] error: hook failed: "stop"
"

That seems to say that traefik did its thing, was then happy, and the problem is only that both grafana and loki are in an error state.

Now, I don't know why those are in error state, from what you linked in the charm, there doesn't seem to be much to go wrong (all I'm doing is setting the unit status.)

However, there is a *lot* of other code that gets executed while running that stop hook, which could be failing. For example:

class GrafanaCharm(CharmBase):
...
    def __init__(self, *args):
...
        self.containers = {
            "workload": self.unit.get_container(self.name),
            "replication": self.unit.get_container("litestream"),
        }
^- is there something problematic while trying to grab containers while tearing down?
...
        self.metrics_endpoint = MetricsEndpointProvider(
            charm=self,
            jobs=self._scrape_jobs,
            refresh_event=[
                self.on.grafana_pebble_ready, # pyright: ignore
                self.on.update_status,
            ],
        )
^- is MetricsEndpoint running into anything. (I'm guessing you're passing in the events that you're asking it to event on, but since you're passing in 'self' here, it could be doing lots of things in '__init__' possibly even registering an on.stop handler.

What we'd really need is to see more of why the hooks themselves failed. Which doesn't seem to be exposed by the CI suite.
All it says is "I didn't become idle" but no recursion into "here's the thing that didn't become idle, and what rationale it has for not being happy"

That might be something that we think should be addressed in python-libjuju and `wait_for_idle` though that is really a very heavy lift for that simple function.

Changed in juju:
status: New → Incomplete
Revision history for this message
Leon (sed-i) wrote :

Thanks for looking into this John!

You're right that even if there isn't an explicit .observe(), the charm is still re-init'ed on stop.
But iirc, there wasn't anything in debug log at all.
When we run into relation permission error etc, we have a traceback. Not here.

This may be a variation of that issues we had in the past, where random hooks would fail on slow github runners. That issue seems to resurface now - we encountered it recently with 3.1.
https://github.com/canonical/cos-lite-bundle/actions/runs/6384705155/job/17332093093
But again, not sure at all those are related to the topic of this issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.