Kubernetes Control Plane Charm

Kubeflow charms fail with unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated

Bug #1991326 reported by Bas de Bruijne on 2022-09-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Control Plane Charm	New	Undecided	Unassigned

Bug Description

Testrun https://solutions.qa.canonical.com/testruns/testRun/71ccb76a-4c34-4293-9c2d-cc139a9898c6 fails with the following status:

```
Model Controller Cloud/Region Version SLA Timestamp
kubeflow foundations-k8s kubernetes_cloud/us-east-1 2.9.34 unsupported 14:45:38Z

App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-image@84a4d7d active 1 admission-webhook 1.6/stable 50 10.152.183.66 no
argo-controller res:oci-image@669ebd5 active 1 argo-controller 3.3/stable 99 no
dex-auth active 1 dex-auth 2.31/stable 129 10.152.183.99 no
istio-ingressgateway active 1 istio-gateway 1.11/stable 114 10.152.183.126 no
istio-pilot waiting 1 istio-pilot 1.11/stable 131 10.152.183.76 no installing agent
jupyter-controller res:oci-image@8f4ec33 active 1 jupyter-controller 1.6/stable 138 no
jupyter-ui res:oci-image@cde6632 active 1 jupyter-ui 1.6/stable 99 10.152.183.46 no
kfp-api res:oci-image@1b44753 active 1 kfp-api 2.0/stable 81 10.152.183.82 no
kfp-db mariadb/server:10.3 active 1 charmed-osm-mariadb-k8s latest/stable 35 10.152.183.2 no ready
kfp-persistence res:oci-image@31f08ad waiting 2/1 kfp-persistence 2.0/stable 76 no
kfp-profile-controller res:oci-image@d86ecff active 1 kfp-profile-controller 2.0/stable 61 10.152.183.58 no
kfp-schedwf res:oci-image@51ffc60 active 1 kfp-schedwf 2.0/stable 80 no
kfp-ui res:oci-image@55148fd active 1 kfp-ui 2.0/stable 80 10.152.183.176 no
kfp-viewer res:oci-image@7190aa3 active 1 kfp-viewer 2.0/stable 79 no
kfp-viz res:oci-image@67e8b09 waiting 2/1 kfp-viz 2.0/stable 74 10.152.183.29 no
kubeflow-dashboard res:oci-image@6fe6eec active 1 kubeflow-dashboard 1.6/stable 154 10.152.183.108 no
kubeflow-profiles res:profile-image@0a46ffc active 1 kubeflow-profiles 1.6/stable 82 10.152.183.186 no
kubeflow-roles active 1 kubeflow-roles 1.6/stable 31 10.152.183.187 no
kubeflow-volumes res:oci-image@cc5177a active 1 kubeflow-volumes 1.6/stable 64 10.152.183.183 no
metacontroller-operator active 1 metacontroller-operator 2.0/stable 48 10.152.183.253 no
minio res:oci-image@1755999 active 1 minio ckf-1.6/stable 99 10.152.183.33 no
oidc-gatekeeper res:oci-image@32de216 active 1 oidc-gatekeeper ckf-1.6/stable 76 10.152.183.248 no
seldon-controller-manager res:oci-image@eb811b6 active 1 seldon-core 1.14/stable 92 10.152.183.188 no
training-operator active 1 training-operator 1.5/stable 65 10.152.183.39 no

Unit Workload Agent admission-webhook/0* argo-controller/0* active idle dex-auth/0* active idle istio-ingressgateway/0* istio-pilot/0* waiting idle jupyter-controller/0* jupyter-ui/0* active idle kfp-api/0* kfp-db/0* active idle kfp-persistence/0* error idle kfp-persistence/1 kfp-profile-controller/0* kfp-schedwf/0* active idle kfp-ui/0* active idle kfp-viewer/0* active idle kfp-viz/0* error idle kfp-viz/1 kubeflow-dashboard/0* kubeflow-profiles/0* kubeflow-roles/0* active idle kubeflow-volumes/0* active idle metacontroller-operator/0* minio/0* active idle oidc-gatekeeper/0* active idle seldon-controller-operator/0* ``` Address Ports Message
active idle 192.168.94.139 4443/TCP
192.168.192.207
192.168.188.199
active idle 192.168.94.137
192.168.188.200 Waiting for gateway address
active idle 192.168.94.140
192.168.188.207 5000/TCP
active executing 192.168.188.218 8888/TCP,8887/TCP
192.168.188.213 3306/TCP ready
192.168.94.150 unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated
waiting executing 192.168.192.210 Waiting for leadership
active idle 192.168.188.217 80/TCP
192.168.94.143
192.168.192.208 3000/TCP
192.168.188.210
192.168.94.146 8888/TCP unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated
waiting executing 192.168.188.219 8888/TCP Waiting for leadership
active idle 192.168.94.145 8082/TCP
active idle 192.168.188.214 8080/TCP,8081/TCP
192.168.188.201
192.168.192.204 5000/TCP
active idle 192.168.188.203
192.168.192.206 9000/TCP,9001/TCP
192.168.94.149 8080/TCP
/>manager/0* active idle 192.168.188.216 8080/TCP,4443/TCP
active idle 192.168.188.205

So kfp-viz and kfp-persistence are in an error state. It seems like juju tried to spin up a new instance of both of them, but failed with the message: unknown container reason "ContainerStatusUnknown": The container could not be located when the pod was terminated.

I can't find why this is happening. I opened this bug with kubeflow first but they think its an issue in k8s. This is ck8s 1.22 for compatibility with kubeflow.

Logs and configs can be found here:
https://oil-jenkins.canonical.com/artifacts/71ccb76a-4c34-4293-9c2d-cc139a9898c6/index.html

Revision history for this message

George Kraft (cynerva) wrote on 2022-09-29:

It looks like those pods were evicted because of low disk space. From the crashdump's kubernetes-control-plane_0/debug-20220923144655.tar.gz/kubectl/describe-pods:

Name: kfp-persistence-956cf567c-v76rm
...
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.

Name: kfp-viz-77f7559897-qtj8l
...
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.

I'm not quite sure why that led the containers to ContainerStatusUnknown state - that looks like a kubelet or containerd bug. Regardless, you will probably need more disk space on kubernetes-worker units.

Consider raising the root-disk constraint of your kubernetes-worker units to 50G to match the documented storage requirement for charmed kubeflow[1]. I suspect that doing so will prevent you from encountering this bug.

[1]: https://charmed-kubeflow.io/docs/install

Revision history for this message

Canonical Solutions QA Bot (oil-ci-bot) wrote on 2022-10-04:

This bug is fixed with commit bb6df750 to cpe-foundation on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cpe-foundation/commit/?id=bb6df750

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.