I'm running a load test for our observability stack, and every two hours the system is under high load because prometheus is flushing data to disk every two hours.
As a result, Juju has the following log entries:
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "alertmanager": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/alertmanager-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/prometheus-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "grafana": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/grafana-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "loki": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/loki-operator": net/http: TLS handshake timeout
Followed by:
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "alertmanager": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/alertmanager-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/prometheus-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "loki": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/loki-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "grafana": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/grafana-operator": dial tcp 10.152.183.1:443: connect: connection refused
This happens exactly every two hours and seems to be the result of temporary high system load.
It could be handy if Juju included a note, e.g.: "Note: this could be because system load is such-and-such".
As the controller charm gains the capability to integrate with our observability stack, this sort of info is probably best surfaced as part of that work.