Unit in unknown status - Too little info about what went wrong

Bug #1993201 reported by Jose C. Massón
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

If you deploy 2 prometheus[1] units and one extra unit of another charm, one of these units ends up in an unknown state, with to little information about what went wrong.

To Reproduce:

- Deploy one prometheus called "external-prometheus":

juju deploy ./*.charm external-prometheus --resource prometheus image=ubuntu/prometheus:2.33-22.04_beta --trust

- Deploy two prometheus units:

juju deploy ./*.charm prometheus -n 2 --resource prometheus-image=ubuntu/prometheus:2.33-22.04_beta --trust

And I get the following status:

$ juju status --color --relations
Model Controller Cloud/Region Version SLA Timestamp
cos-lite charm-dev microk8s/localhost 2.9.35 unsupported 17:19:56-03:00

App Version Status Scale Charm Channel Rev Address Exposed Message
external-prometheus 2.33.5 active 1 prometheus-k8s 7 10.152.183.143 no
prometheus 2.33.5 waiting 1/2 prometheus-k8s 8 10.152.183.35 no installing agent

Unit Workload Agent Address Ports Message
external-prometheus/0* active idle 10.1.207.140
prometheus/0* unknown lost agent lost, see 'juju show-status-log prometheus/0'
prometheus/1 active idle 10.1.207.141

Relation provider Requirer Interface Type Message
external-prometheus:prometheus-peers external-prometheus:prometheus-peers prometheus_peers peer
prometheus:prometheus-peers prometheus:prometheus-peers prometheus_peers peer

juju debug-log:

unit-prometheus-0: 17:19:04.625 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-prometheus-0: 17:19:04.647 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-prometheus-0
unit-prometheus-0: 17:19:04.663 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-prometheus-0: 17:19:04.663 INFO juju.worker.caasupgrader unblocking abort check
unit-prometheus-0: 17:19:04.749 INFO juju.worker.uniter unit "prometheus/0" started
unit-prometheus-0: 17:19:04.784 INFO juju.worker.uniter hooks are retried true
unit-prometheus-0: 17:19:04.963 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-0: 17:19:05.537 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:05.569 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
unit-prometheus-0: 17:19:06.551 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:07.472 INFO unit.prometheus/1.juju-log Kubernetes resources for app 'prometheus', container 'prometheus' patched successfully: ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'})
unit-prometheus-1: 17:19:07.538 INFO unit.prometheus/1.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'})
unit-prometheus-1: 17:19:07.785 INFO unit.prometheus/1.juju-log Pushed new configuration
unit-prometheus-1: 17:19:09.183 INFO unit.prometheus/1.juju-log Prometheus (re)started
unit-prometheus-1: 17:19:09.687 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:09.752 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:09.775 INFO juju.worker.uniter found queued "start" hook
unit-prometheus-0: 17:19:10.006 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-prometheus-1: 17:19:10.903 INFO unit.prometheus/1.juju-log Running legacy hooks/start.
unit-prometheus-0: 17:19:11.308 INFO unit.prometheus/0.juju-log Running legacy hooks/start.
unit-prometheus-1: 17:19:12.415 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:14.023 INFO juju.worker.uniter.operation ran "leader-settings-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:14.767 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:15.343 INFO unit.prometheus/1.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'})
unit-prometheus-1: 17:19:16.238 INFO juju.worker.uniter.operation ran "prometheus-pebble-ready" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:18.068 INFO juju.worker.uniter.operation ran "prometheus-peers-relation-joined" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:18.517 INFO unit.prometheus/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-0: 17:19:20.051 INFO juju.worker.uniter.operation ran "prometheus-pebble-ready" hook (via hook dispatching script: dispatch)
unit-prometheus-1: 17:19:20.398 INFO juju.worker.uniter.operation ran "prometheus-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-0: 17:19:22.630 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-prometheus-0: 17:19:22.840 ERROR juju.worker.uniter.operation hook "prometheus-peers-relation-joined" (via hook dispatching script: dispatch) failed: signal: terminated
unit-prometheus-0: 17:19:22.851 INFO juju.worker.uniter awaiting error resolution for "relation-joined" hook
unit-prometheus-0: 17:19:23.190 INFO juju.worker.uniter awaiting error resolution for "relation-joined" hook
unit-prometheus-1: 17:20:05.429 INFO juju.worker.leadership prometheus/1 promoted to leadership of prometheus
unit-prometheus-1: 17:20:05.464 INFO juju.worker.uniter found queued "leader-elected" hook
unit-prometheus-1: 17:20:07.214 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)

And the status log:

$ juju show-status-log prometheus/0
Time Type Status Message
14 Oct 2022 17:18:05-03:00 juju-unit executing running prometheus-peers-relation-created hook
14 Oct 2022 17:18:17-03:00 juju-unit error hook failed: "prometheus-peers-relation-created"
14 Oct 2022 17:18:22-03:00 juju-unit executing running prometheus-peers-relation-created hook
14 Oct 2022 17:18:24-03:00 juju-unit executing running leader-elected hook
14 Oct 2022 17:18:29-03:00 juju-unit executing running prometheus-pebble-ready hook
14 Oct 2022 17:18:34-03:00 juju-unit executing running database-storage-attached hook
14 Oct 2022 17:18:37-03:00 juju-unit executing running config-changed hook
14 Oct 2022 17:18:43-03:00 juju-unit executing running start hook
14 Oct 2022 17:18:46-03:00 juju-unit error hook failed: "start"
14 Oct 2022 17:18:55-03:00 juju-unit error crash loop backoff: back-off 10s restarting failed container=charm pod=prometheus-0_cos-lite(5e1c5f13-d4f4-47d1-ada1-84bba1b25e79)
14 Oct 2022 17:18:55-03:00 workload maintenance installing charm software
14 Oct 2022 17:19:04-03:00 juju-unit error hook failed: "start"
14 Oct 2022 17:19:10-03:00 juju-unit executing running start hook
14 Oct 2022 17:19:14-03:00 workload unknown
14 Oct 2022 17:19:15-03:00 juju-unit executing running prometheus-pebble-ready hook
14 Oct 2022 17:19:18-03:00 workload waiting Waiting for resource limit patch to apply
14 Oct 2022 17:19:20-03:00 juju-unit executing running prometheus-peers-relation-joined hook for prometheus/1
14 Oct 2022 17:19:22-03:00 juju-unit error hook failed: "prometheus-peers-relation-joined"
14 Oct 2022 17:19:36-03:00 juju-unit idle
14 Oct 2022 17:19:36-03:00 workload blocked 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

And K8s describe pod:

$ microk8s.kubectl -n cos-lite describe pods/prometheus-0
Name: prometheus-0
Namespace: cos-lite
Priority: 0
Service Account: prometheus
Node: <none>
Labels: app.kubernetes.io/name=prometheus
                  controller-revision-hash=prometheus-545b757b9c
                  statefulset.kubernetes.io/pod-name=prometheus-0
Annotations: controller.juju.is/id: cb85ccce-0d1e-4572-830b-39679a25ed79
                  juju.is/version: 2.9.35
                  model.juju.is/id: 6db7bf20-240f-4733-8ed1-896b62f463c2
                  unit.juju.is/id: prometheus/0
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/prometheus
Init Containers:
  charm-init:
    Image: jujusolutions/jujud-operator:2.9.35
    Port: <none>
    Host Port: <none>
    Command:
      /opt/containeragent
    Args:
      init
      --containeragent-pebble-dir
      /containeragent/pebble
      --charm-modified-version
      0
      --data-dir
      /var/lib/juju
      --bin-dir
      /charm/bin
    Environment Variables from:
      prometheus-application-config Secret Optional: false
    Environment:
      JUJU_CONTAINER_NAMES: prometheus
      JUJU_K8S_POD_NAME: prometheus-0 (v1:metadata.name)
      JUJU_K8S_POD_UUID: (v1:metadata.uid)
    Mounts:
      /charm/bin from charm-data (rw,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /containeragent/pebble from charm-data (rw,path="containeragent/pebble")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w8j4p (ro)
Containers:
  charm:
    Image: jujusolutions/charm-base:ubuntu-20.04
    Port: <none>
    Host Port: <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --http
      :38812
      --verbose
    Liveness: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness: http-get http://:38812/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAMES: prometheus
      HTTP_PROBE_PORT: 3856
    Mounts:
      /charm/bin from charm-data (ro,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/lib/juju/storage/database/0 from prometheus-database-ff2c93ce (rw)
      /var/lib/pebble/default from charm-data (rw,path="containeragent/pebble")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w8j4p (ro)
  prometheus:
    Image: ubuntu/prometheus:2.33-22.04_beta
    Port: <none>
    Host Port: <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    Requests:
      cpu: 250m
      memory: 200Mi
    Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAME: prometheus
      PEBBLE_SOCKET: /charm/container/pebble.socket
    Mounts:
      /charm/bin/pebble from charm-data (ro,path="charm/bin/pebble")
      /charm/container from charm-data (rw,path="charm/containers/prometheus")
      /var/lib/prometheus from prometheus-database-ff2c93ce (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w8j4p (ro)
Conditions:
  Type Status
  PodScheduled False
Volumes:
  prometheus-database-ff2c93ce:
    Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName: prometheus-database-ff2c93ce-prometheus-0
    ReadOnly: false
  charm-data:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit: <unset>
  kube-api-access-w8j4p:
    Type: Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds: 3607
    ConfigMapName: kube-root-ca.crt
    ConfigMapOptional: <nil>
    DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 5s default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning FailedScheduling 4s default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

More info about this issue can be found here: https://github.com/canonical/prometheus-k8s-operator/issues/389

[1] https://github.com/canonical/prometheus-k8s-operator

Revision history for this message
Juan M. Tirado (tiradojm) wrote :

I will set this bug to invalid because I see that the parallel discussion on Github [1] has been closed.

[1]https://github.com/canonical/prometheus-k8s-operator/issues/389

Changed in juju:
status: New → Invalid
Revision history for this message
Jose C. Massón (jose-masson) wrote :

Hi Juan,

I closed the bug in github since it is not a Prometheus bug, is a Juju one:

"There is no issue with prometehus charm itself, it is Juju providing too little information about what went wrong. Issue here."

Changed in juju:
status: Invalid → New
Revision history for this message
Ian Booth (wallyworld) wrote :

We should reasonably be able to surface any underlying pod scheduling error in juju status for the unit.

Changed in juju:
milestone: none → 2.9.36
importance: Undecided → High
status: New → Triaged
milestone: 2.9.36 → 2.9.37
Changed in juju:
milestone: 2.9.37 → 2.9.38
Changed in juju:
milestone: 2.9.38 → 2.9.39
Changed in juju:
milestone: 2.9.39 → 2.9.40
Changed in juju:
milestone: 2.9.40 → 2.9.41
Changed in juju:
milestone: 2.9.41 → 2.9.42
Changed in juju:
milestone: 2.9.42 → 2.9.43
Changed in juju:
milestone: 2.9.43 → 2.9.44
Changed in juju:
milestone: 2.9.44 → 2.9.45
Changed in juju:
milestone: 2.9.45 → 2.9.46
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.