Kubeflow charms fail with unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated

Bug #1991326 reported by Bas de Bruijne
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
New
Undecided
Unassigned

Bug Description

Testrun https://solutions.qa.canonical.com/testruns/testRun/71ccb76a-4c34-4293-9c2d-cc139a9898c6 fails with the following status:

```
Model Controller Cloud/Region Version SLA Timestamp
kubeflow foundations-k8s kubernetes_cloud/us-east-1 2.9.34 unsupported 14:45:38Z

App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-image@84a4d7d active 1 admission-webhook 1.6/stable 50 10.152.183.66 no
argo-controller res:oci-image@669ebd5 active 1 argo-controller 3.3/stable 99 no
dex-auth active 1 dex-auth 2.31/stable 129 10.152.183.99 no
istio-ingressgateway active 1 istio-gateway 1.11/stable 114 10.152.183.126 no
istio-pilot waiting 1 istio-pilot 1.11/stable 131 10.152.183.76 no installing agent
jupyter-controller res:oci-image@8f4ec33 active 1 jupyter-controller 1.6/stable 138 no
jupyter-ui res:oci-image@cde6632 active 1 jupyter-ui 1.6/stable 99 10.152.183.46 no
kfp-api res:oci-image@1b44753 active 1 kfp-api 2.0/stable 81 10.152.183.82 no
kfp-db mariadb/server:10.3 active 1 charmed-osm-mariadb-k8s latest/stable 35 10.152.183.2 no ready
kfp-persistence res:oci-image@31f08ad waiting 2/1 kfp-persistence 2.0/stable 76 no
kfp-profile-controller res:oci-image@d86ecff active 1 kfp-profile-controller 2.0/stable 61 10.152.183.58 no
kfp-schedwf res:oci-image@51ffc60 active 1 kfp-schedwf 2.0/stable 80 no
kfp-ui res:oci-image@55148fd active 1 kfp-ui 2.0/stable 80 10.152.183.176 no
kfp-viewer res:oci-image@7190aa3 active 1 kfp-viewer 2.0/stable 79 no
kfp-viz res:oci-image@67e8b09 waiting 2/1 kfp-viz 2.0/stable 74 10.152.183.29 no
kubeflow-dashboard res:oci-image@6fe6eec active 1 kubeflow-dashboard 1.6/stable 154 10.152.183.108 no
kubeflow-profiles res:profile-image@0a46ffc active 1 kubeflow-profiles 1.6/stable 82 10.152.183.186 no
kubeflow-roles active 1 kubeflow-roles 1.6/stable 31 10.152.183.187 no
kubeflow-volumes res:oci-image@cc5177a active 1 kubeflow-volumes 1.6/stable 64 10.152.183.183 no
metacontroller-operator active 1 metacontroller-operator 2.0/stable 48 10.152.183.253 no
minio res:oci-image@1755999 active 1 minio ckf-1.6/stable 99 10.152.183.33 no
oidc-gatekeeper res:oci-image@32de216 active 1 oidc-gatekeeper ckf-1.6/stable 76 10.152.183.248 no
seldon-controller-manager res:oci-image@eb811b6 active 1 seldon-core 1.14/stable 92 10.152.183.188 no
training-operator active 1 training-operator 1.5/stable 65 10.152.183.39 no

Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 192.168.94.139 4443/TCP
argo-controller/0* active idle 192.168.192.207
dex-auth/0* active idle 192.168.188.199
istio-ingressgateway/0* active idle 192.168.94.137
istio-pilot/0* waiting idle 192.168.188.200 Waiting for gateway address
jupyter-controller/0* active idle 192.168.94.140
jupyter-ui/0* active idle 192.168.188.207 5000/TCP
kfp-api/0* active executing 192.168.188.218 8888/TCP,8887/TCP
kfp-db/0* active idle 192.168.188.213 3306/TCP ready
kfp-persistence/0* error idle 192.168.94.150 unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated
kfp-persistence/1 waiting executing 192.168.192.210 Waiting for leadership
kfp-profile-controller/0* active idle 192.168.188.217 80/TCP
kfp-schedwf/0* active idle 192.168.94.143
kfp-ui/0* active idle 192.168.192.208 3000/TCP
kfp-viewer/0* active idle 192.168.188.210
kfp-viz/0* error idle 192.168.94.146 8888/TCP unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated
kfp-viz/1 waiting executing 192.168.188.219 8888/TCP Waiting for leadership
kubeflow-dashboard/0* active idle 192.168.94.145 8082/TCP
kubeflow-profiles/0* active idle 192.168.188.214 8080/TCP,8081/TCP
kubeflow-roles/0* active idle 192.168.188.201
kubeflow-volumes/0* active idle 192.168.192.204 5000/TCP
metacontroller-operator/0* active idle 192.168.188.203
minio/0* active idle 192.168.192.206 9000/TCP,9001/TCP
oidc-gatekeeper/0* active idle 192.168.94.149 8080/TCP
seldon-controller-manager/0* active idle 192.168.188.216 8080/TCP,4443/TCP
training-operator/0* active idle 192.168.188.205
```

So kfp-viz and kfp-persistence are in an error state. It seems like juju tried to spin up a new instance of both of them, but failed with the message: unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated.

I can't find why this is happening. I opened this bug with kubeflow first but they think its an issue in k8s. This is ck8s 1.22 for compatibility with kubeflow.

Logs and configs can be found here:
https://oil-jenkins.canonical.com/artifacts/71ccb76a-4c34-4293-9c2d-cc139a9898c6/index.html

Revision history for this message
George Kraft (cynerva) wrote :

It looks like those pods were evicted because of low disk space. From the crashdump's kubernetes-control-plane_0/debug-20220923144655.tar.gz/kubectl/describe-pods:

Name: kfp-persistence-956cf567c-v76rm
...
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.

Name: kfp-viz-77f7559897-qtj8l
...
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.

I'm not quite sure why that led the containers to ContainerStatusUnknown state - that looks like a kubelet or containerd bug. Regardless, you will probably need more disk space on kubernetes-worker units.

Consider raising the root-disk constraint of your kubernetes-worker units to 50G to match the documented storage requirement for charmed kubeflow[1]. I suspect that doing so will prevent you from encountering this bug.

[1]: https://charmed-kubeflow.io/docs/install

Revision history for this message
Canonical Solutions QA Bot (oil-ci-bot) wrote :

This bug is fixed with commit bb6df750 to cpe-foundation on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cpe-foundation/commit/?id=bb6df750

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.