Kubeflow charms fail with unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Kubernetes Control Plane Charm |
New
|
Undecided
|
Unassigned |
Bug Description
Testrun https:/
```
Model Controller Cloud/Region Version SLA Timestamp
kubeflow foundations-k8s kubernetes_
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-
argo-controller res:oci-
dex-auth active 1 dex-auth 2.31/stable 129 10.152.183.99 no
istio-ingressga
istio-pilot waiting 1 istio-pilot 1.11/stable 131 10.152.183.76 no installing agent
jupyter-controller res:oci-
jupyter-ui res:oci-
kfp-api res:oci-
kfp-db mariadb/server:10.3 active 1 charmed-
kfp-persistence res:oci-
kfp-profile-
kfp-schedwf res:oci-
kfp-ui res:oci-
kfp-viewer res:oci-
kfp-viz res:oci-
kubeflow-dashboard res:oci-
kubeflow-profiles res:profile-
kubeflow-roles active 1 kubeflow-roles 1.6/stable 31 10.152.183.187 no
kubeflow-volumes res:oci-
metacontroller-
minio res:oci-
oidc-gatekeeper res:oci-
seldon-
training-operator active 1 training-operator 1.5/stable 65 10.152.183.39 no
Unit Workload Agent Address Ports Message
admission-
argo-controller/0* active idle 192.168.192.207
dex-auth/0* active idle 192.168.188.199
istio-ingressga
istio-pilot/0* waiting idle 192.168.188.200 Waiting for gateway address
jupyter-
jupyter-ui/0* active idle 192.168.188.207 5000/TCP
kfp-api/0* active executing 192.168.188.218 8888/TCP,8887/TCP
kfp-db/0* active idle 192.168.188.213 3306/TCP ready
kfp-persistence/0* error idle 192.168.94.150 unknown container reason "ContainerStatu
kfp-persistence/1 waiting executing 192.168.192.210 Waiting for leadership
kfp-profile-
kfp-schedwf/0* active idle 192.168.94.143
kfp-ui/0* active idle 192.168.192.208 3000/TCP
kfp-viewer/0* active idle 192.168.188.210
kfp-viz/0* error idle 192.168.94.146 8888/TCP unknown container reason "ContainerStatu
kfp-viz/1 waiting executing 192.168.188.219 8888/TCP Waiting for leadership
kubeflow-
kubeflow-
kubeflow-roles/0* active idle 192.168.188.201
kubeflow-volumes/0* active idle 192.168.192.204 5000/TCP
metacontroller-
minio/0* active idle 192.168.192.206 9000/TCP,9001/TCP
oidc-gatekeeper/0* active idle 192.168.94.149 8080/TCP
seldon-
training-
```
So kfp-viz and kfp-persistence are in an error state. It seems like juju tried to spin up a new instance of both of them, but failed with the message: unknown container reason "ContainerStatu
I can't find why this is happening. I opened this bug with kubeflow first but they think its an issue in k8s. This is ck8s 1.22 for compatibility with kubeflow.
Logs and configs can be found here:
https:/
It looks like those pods were evicted because of low disk space. From the crashdump's kubernetes- control- plane_0/ debug-202209231 44655.tar. gz/kubectl/ describe- pods:
Name: kfp-persistence -956cf567c- v76rm
...
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.
Name: kfp-viz- 77f7559897- qtj8l
...
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.
I'm not quite sure why that led the containers to ContainerStatus Unknown state - that looks like a kubelet or containerd bug. Regardless, you will probably need more disk space on kubernetes-worker units.
Consider raising the root-disk constraint of your kubernetes-worker units to 50G to match the documented storage requirement for charmed kubeflow[1]. I suspect that doing so will prevent you from encountering this bug.
[1]: https:/ /charmed- kubeflow. io/docs/ install