kubeflow-lite: Failed to deploy kfp-viz

Bug #1981335 reported by Nikos Sklikas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Undecided
Unassigned
Snap Store Server
New
Undecided
Unassigned

Bug Description

I am trying to install kubeflow-lite on microk8s on my local machine following the instructions at https://charmed-kubeflow.io/docs/quickstart.

I am getting ImagePullBackOff for kfp-viz after running:
$ juju deploy kubeflow-lite --trust

I have tried it many times and I am always getting the same error for the same charm, sometimes I got ImagePullBackOff for kfp-profile-controller, but it seems to be random.

I get an error related to wrong size:

$ microk8s.kubectl describe pod kfp-viz-7cf5b4d6fd-bctsh -n kubeflow

Name: kfp-viz-7cf5b4d6fd-bctsh
Namespace: kubeflow
Priority: 0
Node: nikopc/192.168.2.4
Start Time: Mon, 11 Jul 2022 17:01:45 +0300
Labels: app.kubernetes.io/name=kfp-viz
              pod-template-hash=7cf5b4d6fd
Annotations: apparmor.security.beta.kubernetes.io/pod: runtime/default
              charm.juju.is/modified-version: 0
              cni.projectcalico.org/podIP: 10.1.184.63/32
              cni.projectcalico.org/podIPs: 10.1.184.63/32
              controller.juju.is/id: 4bc95001-cb3b-4227-8474-97d964b7f7a4
              model.juju.is/id: 5ddd8c92-3b2f-4b96-8b6a-a922c260fdca
              seccomp.security.beta.kubernetes.io/pod: docker/default
              unit.juju.is/id: kfp-viz/0
Status: Pending
IP: 10.1.184.63
IPs:
  IP: 10.1.184.63
Controlled By: ReplicaSet/kfp-viz-7cf5b4d6fd
Init Containers:
  juju-pod-init:
    Container ID: containerd://367c75591a6afcc6ca5e2934c575def9bf26beb6bcb123cd067d526b6f484774
    Image: jujusolutions/jujud-operator:2.9.32
    Image ID: docker.io/jujusolutions/jujud-operator@sha256:7eff2c7dcd6e826217330aa25c24eb45a0882893689c46d92e31afe309ebb08d
    Port: <none>
    Host Port: <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud

      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi

    State: Terminated
      Reason: Completed
      Exit Code: 0
      Started: Mon, 11 Jul 2022 17:01:47 +0300
      Finished: Mon, 11 Jul 2022 17:01:52 +0300
    Ready: True
    Restart Count: 0
    Environment: <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gtn97 (ro)
Containers:
  ml-pipeline-visualizationserver:
    Container ID:
    Image: registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698
    Image ID:
    Port: 8888/TCP
    Host Port: 0/TCP
    State: Waiting
      Reason: ImagePullBackOff
    Ready: False
    Restart Count: 0
    Liveness: exec [wget -q -S -O - http://localhost:8888/] delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness: exec [wget -q -S -O - http://localhost:8888/] delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment: <none>
    Mounts:
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gtn97 (ro)
Conditions:
  Type Status
  Initialized True
  Ready False
  ContainersReady False
  PodScheduled True
Volumes:
  juju-data-dir:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit: <unset>
  kube-api-access-gtn97:
    Type: Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds: 3607
    ConfigMapName: kube-root-ca.crt
    ConfigMapOptional: <nil>
    DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 18m default-scheduler Successfully assigned kubeflow/kfp-viz-7cf5b4d6fd-bctsh to nikopc
  Normal Pulled 18m kubelet Container image "jujusolutions/jujud-operator:2.9.32" already present on machine
  Normal Created 18m kubelet Created container juju-pod-init
  Normal Started 18m kubelet Started container juju-pod-init
  Warning Failed 14m kubelet Failed to pull image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698": failed commit on ref "layer-sha256:a6259a64132411c7b9f9778a71c440f6a50772569b9a4318bea4cdab80a0dd1b": "layer-sha256:a6259a64132411c7b9f9778a71c440f6a50772569b9a4318bea4cdab80a0dd1b" failed size validation: 21375856 != 151098959: failed precondition
  Warning Failed 11m kubelet Failed to pull image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698": failed commit on ref "layer-sha256:a6259a64132411c7b9f9778a71c440f6a50772569b9a4318bea4cdab80a0dd1b": "layer-sha256:a6259a64132411c7b9f9778a71c440f6a50772569b9a4318bea4cdab80a0dd1b" failed size validation: 21265360 != 151098959: failed precondition
  Normal BackOff 10m (x2 over 14m) kubelet Back-off pulling image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698"
  Warning Failed 10m (x2 over 14m) kubelet Error: ImagePullBackOff
  Normal Pulling 10m (x3 over 18m) kubelet Pulling image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698"
  Warning Failed 8m18s kubelet Failed to pull image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698": failed commit on ref "layer-sha256:3db2b246fb429f27589629656b5dba9f9be6eae74a468fb210f1515df7fb59d6": "layer-sha256:3db2b246fb429f27589629656b5dba9f9be6eae74a468fb210f1515df7fb59d6" failed size validation: 20068856 != 32364158: failed precondition
  Warning Failed 8m18s (x3 over 14m) kubelet Error: ErrImagePull
  Warning DNSConfigForming 3m28s (x20 over 18m) kubelet Search Line limits were exceeded, some search paths have been omitted, the applied search line is: kubeflow.svc.cluster.local svc.cluster.local cluster.local enablement external internal

I am running juju v2.9.32
Please let me know if I can provide more info to help you reproduce this.

Tags: charmhub
Revision history for this message
Nikos Sklikas (nsklikas) wrote :
Revision history for this message
Andrew Scribner (ca-scribner) wrote :

To add to this, the error is not simply that the image the charm needs never existed. Others can successfully deploy this charm. There is something happening that is either intermittent or user-specific here that's causing the ImagePullBackoff, although I have no idea what

Revision history for this message
Juan M. Tirado (tiradojm) wrote :

This doesn't seem like a pure Juju error. I will set this bug as invalid. Please feel free to reopen if you find specific Juju-related problems.

Changed in juju:
status: New → Invalid
Revision history for this message
John A Meinel (jameinel) wrote :

I'm adding snapstore-server because I don't have a better project for registry.jujucharms.com

However, if we are making a request for a image and then getting a result that doesn't match the request, that seems a serious issue in the registry.

tags: added: charmhub
Revision history for this message
Nikos Sklikas (nsklikas) wrote (last edit ):

I tried to look into it a little and I'm not sure if it's a problem with the jujucharms registry or with some kubelet config.

I think that it has to do with the size of the image.
I thought it might be related to my slow internet connection, so I tried setting `--runtime-request-timeout` to a longer time (default 2 min) in microk8s' kubelet (/var/snap/microk8s/3202/args/kubelet) config to see if this was caused by a timeout, but it lead nowhere.

I managed to overcome this issue by "manually" downloading the image by:
  - I went to the charm’s page (https://charmhub.io/kfp-viz/resources/oci-image)
  - I downloaded the charm (https://api.charmhub.io/api/v1/resources/download/charm_Z2g5QvSCaQsxR3kKn0AwvNytkMgIhFrN.oci-image_26)
  - The downloaded file is a json containing an ImageName, a Username and a Password (why is a public username/password needed to download an image?)
  - I downloaded the image by running:
        $ microk8s ctr i pull {ImageName} -u {Username}:{Password}

The download took some time to finish, but raised no errors

Revision history for this message
Daniel Manrique (roadmr) wrote :

To test this I got the image info and credentials from the resource (basically what Nikolaos did just with the docker client)

curl -sL https://api.charmhub.io/api/v1/resources/download/charm_Z2g5QvSCaQsxR3kKn0AwvNytkMgIhFrN.oci-image_26 | jq .

then I did:

docker logout
docker login registry.jujucharms.com

and I input the credentials (username, password) from the resource. I then did:

docker pull registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698

(the image identifier comes right from the resource as well).

This downloaded about 4GB worth of layers on a 30 Mbps residential connection. Took its sweet long time but it did work:

$ docker pull registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698
registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698: Pulling from charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image
a16b311a57cb: Pull complete
d8f1984ce468: Pull complete
a6259a641324: Pull complete
08f1612b3894: Pull complete
6efd3ea758ec: Pull complete
d43a70c57613: Pull complete
fe09379c9a6e: Pull complete
92a1f997eb57: Pull complete
e2d89532c3ba: Pull complete
3db2b246fb42: Pull complete
562ba219afe4: Pull complete
6f369261a09f: Pull complete
977cd55548b1: Pull complete
2563eaea9368: Pull complete
35fdeb2cc87a: Pull complete
b28b2e05a521: Pull complete
c40f63038123: Pull complete
Digest: sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698
Status: Downloaded newer image for registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698
registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:13c46cf878062fd6ad672cbec4854eba7e869cd0123a8975bea49b9d75d4e698

$ docker image list registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image
REPOSITORY TAG IMAGE ID CREATED SIZE
registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image <none> 913fc89e48ad 10 months ago 4.04GB

I did get the impression it was kind of slow. What could be happening here is that registry.jujucharms is slow and/or craps out mid-download. If this happens with a desktop client I can retry. I don't know what the behavior is with the kubernetes docker client. The data stored in the registry doesn't seem to be corrupted, so a download hiccup causing unrecoverable corruption sounds like a docker shortcoming rather than a registry problem.

My team doesn't directly operate registry.jujucharms.com, but I can check internally to see who is considered responsible for its operation and see if there are any metrics they could check or share to understand if the service is overloaded or having any kind of network trouble, and then scale or mitigate accordingly.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.