Canonical Juju

Unit in unknown status - Too little info about what went wrong

Bug #1993201 reported by Jose C. Massón on 2022-10-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	High	Unassigned	Canonical Juju 2.9.46

Bug Description

If you deploy 2 prometheus[1] units and one extra unit of another charm, one of these units ends up in an unknown state, with to little information about what went wrong.

To Reproduce:

- Deploy one prometheus called "external-prometheus":

juju deploy ./*.charm external-prometheus --resource prometheus image=ubuntu/prometheus:2.33-22.04_beta --trust

- Deploy two prometheus units:

juju deploy ./*.charm prometheus -n 2 --resource prometheus-image=ubuntu/prometheus:2.33-22.04_beta --trust

And I get the following status:

$ juju status --color --relations
Model Controller Cloud/Region Version SLA Timestamp
cos-lite charm-dev microk8s/localhost 2.9.35 unsupported 17:19:56-03:00

App Version Status Scale Charm Channel Rev Address Exposed Message
external-prometheus 2.33.5 active 1 prometheus-k8s 7 10.152.183.143 no
prometheus 2.33.5 waiting 1/2 prometheus-k8s 8 10.152.183.35 no installing agent

Unit Workload Agent Address Ports Message
external-prometheus/0* active idle 10.1.207.140
prometheus/0* unknown lost agent lost, see 'juju show-status-log prometheus/0'
prometheus/1 active idle 10.1.207.141

Relation provider Requirer Interface Type Message
external-prometheus:prometheus-peers external-prometheus:prometheus-peers prometheus_peers peer
prometheus:prometheus-peers prometheus:prometheus-peers prometheus_peers peer

juju debug-log:

unit-prometheus-0: 17:19:04.625 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-prometheus-0: 17:19:04.647 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-prometheus-0
unit-prometheus-0: 17:19:04.663 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-prometheus-0: 17:19:04.663 INFO juju.worker.caasupgrader unblocking abort check
unit-prometheus-0: 17:19:04.749 INFO juju.worker.uniter unit "prometheus/0" started
unit-prometheus-0: 17:19:04.784 INFO juju.worker.uniter hooks are retried true
unit-prometheus-0: 17:19:04.963 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-0: 17:19:05.537 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:05.569 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
unit-prometheus-0: 17:19:06.551 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:07.472 INFO unit.prometheus/1.juju-log Kubernetes resources for app 'prometheus', container 'prometheus' patched successfully: ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'})
unit-prometheus-1: 17:19:07.538 INFO unit.prometheus/1.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'})
unit-prometheus-1: 17:19:07.785 INFO unit.prometheus/1.juju-log Pushed new configuration
unit-prometheus-1: 17:19:09.183 INFO unit.prometheus/1.juju-log Prometheus (re)started
unit-prometheus-1: 17:19:09.687 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:09.752 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:09.775 INFO juju.worker.uniter found queued "start" hook
unit-prometheus-0: 17:19:10.006 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:10.903 INFO unit.prometheus/1.juju-log Running legacy hooks/start.
unit-prometheus-0: 17:19:11.308 INFO unit.prometheus/0.juju-log Running legacy hooks/start.
unit-prometheus-1: 17:19:12.415 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:14.023 INFO juju.worker.uniter.operation ran "leader-settings-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:14.767 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:15.343 INFO unit.prometheus/1.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'})
unit-prometheus-1: 17:19:16.238 INFO juju.worker.uniter.operation ran "prometheus-pebble-ready" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:18.068 INFO juju.worker.uniter.operation ran "prometheus-peers-relation-joined" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:18.517 INFO unit.prometheus/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-0: 17:19:20.051 INFO juju.worker.uniter.operation ran "prometheus-pebble-ready" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:20.398 INFO juju.worker.uniter.operation ran "prometheus-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:22.630 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-prometheus-0: 17:19:22.840 ERROR juju.worker.uniter.operation hook "prometheus-peers-relation-joined" (via hook dispatching script: dispatch) failed: signal: terminated
unit-prometheus-0: 17:19:22.851 INFO juju.worker.uniter awaiting error resolution for "relation-joined" hook
unit-prometheus-0: 17:19:23.190 INFO juju.worker.uniter awaiting error resolution for "relation-joined" hook
unit-prometheus-1: 17:20:05.429 INFO juju.worker.leadership prometheus/1 promoted to leadership of prometheus
unit-prometheus-1: 17:20:05.464 INFO juju.worker.uniter found queued "leader-elected" hook
unit-prometheus-1: 17:20:07.214 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)

And the status log:

$ juju show-status-log prometheus/0
Time Type Status 14 Oct 2022 17:18:05-03:00 juju-unit executing 14 Oct 2022 17:18:17-03:00 juju-unit error 14 Oct 2022 17:18:22-03:00 juju-unit executing 14 Oct 2022 17:18:24-03:00 juju-unit executing 14 Oct 2022 17:18:29-03:00 juju-unit executing 14 Oct 2022 17:18:34-03:00 juju-unit executing 14 Oct 2022 17:18:37-03:00 juju-unit executing 14 Oct 2022 17:18:43-03:00 juju-unit executing 14 Oct 2022 17:18:46-03:00 juju-unit error 14 Oct 2022 17:18:55-03:00 juju-unit error 14 Oct 2022 17:18:55-03:00 workload maintenance 14 Oct 2022 17:19:04-03:00 juju-unit error 14 Oct 2022 17:19:10-03:00 juju-unit executing 14 Oct 2022 17:19:14-03:00 workload 14 Oct 2022 17:19:15-03:00 juju-unit executing 14 Oct 2022 17:19:18-03:00 workload waiting 14 Oct 2022 17:19:20-03:00 juju-unit executing 14 Oct 2022 17:19:22-03:00 juju-unit error 14 Oct 2022 17:19:36-03:00 juju-unit 14 Oct 2022 17:19:36-03:00 workload blocked Message
running prometheus-peers-relation-created hook
hook failed: "prometheus-peers-relation-created"
running prometheus-peers-relation-created hook
running leader-elected hook
running prometheus-pebble-ready hook
running database-storage-attached hook
running config-changed hook
running start hook
hook failed: "start"
crash loop backoff: back-off 10s restarting failed container=charm pod=prometheus-0_cos-lite(5e1c5f13-d4f4-47d1-ada1-84bba1b25e79)
installing charm software
hook failed: "start"
running start hook
unknown
running prometheus-pebble-ready hook
Waiting for resource limit patch to apply
running prometheus-peers-relation-joined hook for prometheus/1
hook failed: "prometheus-peers-relation-joined"
idle
0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

And K8s describe pod:

$ microk8s.kubectl -n cos-lite describe pods/prometheus-0
Name: prometheus-0
Namespace: cos-lite
Priority: 0
Service Account: prometheus
Node: <none>
Labels: app.kubernetes.io/name=prometheus
                  controller-revision-hash=prometheus-545b757b9c
                  statefulset.kubernetes.io/pod-name=prometheus-0
Annotations: controller.juju.is/id: cb85ccce-0d1e-4572-830b-39679a25ed79
                  juju.is/version: 2.9.35
                  model.juju.is/id: 6db7bf20-240f-4733-8ed1-896b62f463c2
                  unit.juju.is/id: prometheus/0
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/prometheus
Init Containers:
  charm-init:
    Image: jujusolutions/jujud-operator:2.9.35
    Port: <none>
    Host Port: <none>
    Command:
      /opt/containeragent
    Args:
      init
      --containeragent-pebble-dir
      /containeragent/pebble
      --charm-modified-version
      0
      --data-dir
      /var/lib/juju
      --bin-dir
      /charm/bin
    Environment Variables from:
      prometheus-application-config Secret Optional: false
    Environment:
      JUJU_CONTAINER_NAMES: prometheus
      JUJU_K8S_POD_NAME: prometheus-0 (v1:metadata.name)
      JUJU_K8S_POD_UUID: (v1:metadata.uid)
    Mounts:
      /charm/bin from charm-data (rw,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /containeragent/pebble from charm-data (rw,path="containeragent/pebble")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w8j4p (ro)
Containers:
  charm:
    Image: jujusolutions/charm-base:ubuntu-20.04
    Port: <none>
    Host Port: <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --http
      :38812
      --verbose
    Liveness: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness: http-get http://:38812/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAMES: prometheus
      HTTP_PROBE_PORT: 3856
    Mounts:
      /charm/bin from charm-data (ro,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/lib/juju/storage/database/0 from prometheus-database-ff2c93ce (rw)
      /var/lib/pebble/default from charm-data (rw,path="containeragent/pebble")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w8j4p (ro)
  prometheus:
    Image: ubuntu/prometheus:2.33-22.04_beta
    Port: <none>
    Host Port: <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    Requests:
      cpu: 250m
      memory: 200Mi
    Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAME: prometheus
      PEBBLE_SOCKET: /charm/container/pebble.socket
    Mounts:
      /charm/bin/pebble from charm-data (ro,path="charm/bin/pebble")
      /charm/container from charm-data (rw,path="charm/containers/prometheus")
      /var/lib/prometheus from prometheus-database-ff2c93ce (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w8j4p (ro)
Conditions:
  Type Status
  PodScheduled False
Volumes:
  prometheus-database-ff2c93ce:
    Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName: prometheus-database-ff2c93ce-prometheus-0
    ReadOnly: false
  charm-data:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit: <unset>
  kube-api-access-w8j4p:
    Type: Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds: 3607
    ConfigMapName: kube-root-ca.crt
    ConfigMapOptional: <nil>
    DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 5s default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning FailedScheduling 4s default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

More info about this issue can be found here: https://github.com/canonical/prometheus-k8s-operator/issues/389

[1] https://github.com/canonical/prometheus-k8s-operator

Revision history for this message

Juan M. Tirado (tiradojm) wrote on 2022-10-18:

I will set this bug to invalid because I see that the parallel discussion on Github [1] has been closed.

[1]https://github.com/canonical/prometheus-k8s-operator/issues/389

Changed in juju:
status:	New → Invalid

Revision history for this message

Jose C. Massón (jose-masson) wrote on 2022-10-19:

Hi Juan,

I closed the bug in github since it is not a Prometheus bug, is a Juju one:

"There is no issue with prometehus charm itself, it is Juju providing too little information about what went wrong. Issue here."

Changed in juju:
status:	Invalid → New

Revision history for this message

Ian Booth (wallyworld) wrote on 2022-10-20:

We should reasonably be able to surface any underlying pod scheduling error in juju status for the unit.

Changed in juju:
milestone:	none → 2.9.36
importance:	Undecided → High
status:	New → Triaged
milestone:	2.9.36 → 2.9.37

Canonical Juju QA Bot (juju-qa-bot) on 2022-11-03

Changed in juju:
milestone:	2.9.37 → 2.9.38

Canonical Juju QA Bot (juju-qa-bot) on 2023-01-06

Changed in juju:
milestone:	2.9.38 → 2.9.39

Canonical Juju QA Bot (juju-qa-bot) on 2023-01-23

Changed in juju:
milestone:	2.9.39 → 2.9.40

Canonical Juju QA Bot (juju-qa-bot) on 2023-02-17

Changed in juju:
milestone:	2.9.40 → 2.9.41

Canonical Juju QA Bot (juju-qa-bot) on 2023-02-27

Changed in juju:
milestone:	2.9.41 → 2.9.42

Canonical Juju QA Bot (juju-qa-bot) on 2023-03-01

Changed in juju:
milestone:	2.9.42 → 2.9.43

Canonical Juju QA Bot (juju-qa-bot) on 2023-06-02

Changed in juju:
milestone:	2.9.43 → 2.9.44

Canonical Juju QA Bot (juju-qa-bot) on 2023-07-10

Changed in juju:
milestone:	2.9.44 → 2.9.45

Canonical Juju QA Bot (juju-qa-bot) on 2023-09-18

Changed in juju:
milestone:	2.9.45 → 2.9.46

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-canonical-prometheus-k8s-operator #389
[closed Status: Triage Type: Bug] Edit

Bug watches keep track of this bug in other bug trackers.