Canonical Juju

Add note about system load to some error messages

Bug #1980115 reported by Leon on 2022-06-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Wishlist	Unassigned

Bug Description

I'm running a load test for our observability stack, and every two hours the system is under high load because prometheus is flushing data to disk every two hours.

As a result, Juju has the following log entries:

controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "alertmanager": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/alertmanager-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/prometheus-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "grafana": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/grafana-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "loki": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/loki-operator": net/http: TLS handshake timeout

Followed by:

controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "alertmanager": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/alertmanager-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/prometheus-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "loki": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/loki-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "grafana": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/grafana-operator": dial tcp 10.152.183.1:443: connect: connection refused

This happens exactly every two hours and seems to be the result of temporary high system load.
It could be handy if Juju included a note, e.g.: "Note: this could be because system load is such-and-such".

Tags:

Revision history for this message

Ian Booth (wallyworld) wrote on 2022-07-05:

As the controller charm gains the capability to integrate with our observability stack, this sort of info is probably best surfaced as part of that work.

tags:	added: observability
Changed in juju:
importance:	Undecided → Wishlist

Vitaly Antonenko (anvial) on 2022-08-11

Changed in juju:
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.