test_clusterip_service_endpoint fails on connection error in k8s 1.24 on ARM

Bug #1974207 reported by Bas de Bruijne
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Testing
Fix Released
Medium
Mateo Florido

Bug Description

In testrun https://solutions.qa.canonical.com/testruns/testRun/5a82c72a-b16d-449a-82ac-665c660aa4c2, with FCE output https://oil-jenkins.canonical.com/job/fce_build/2153//console we see test_clusterip_service_endpoint fail:

```
=================================== FAILURES ===================================
_______________________ test_clusterip_service_endpoint ________________________
Traceback (most recent call last):
  File "/home/ubuntu/k8s-validation/jobs/integration/test_service_endpoints.py", line 124, in test_clusterip_service_endpoint
    raise e
  File "/home/ubuntu/k8s-validation/jobs/integration/test_service_endpoints.py", line 121, in test_clusterip_service_endpoint
    assert "Hello Kubernetes!" in action.results.get("Stdout", "")
AssertionError: assert 'Hello Kubernetes!' in ''
 + where '' = <built-in method get of dict object at 0x7fd2bfb0f440>('Stdout', '')
 + where <built-in method get of dict object at 0x7fd2bfb0f440> = {'Code': '7', 'Stderr': '* Trying 10.152.183.132:80...\n* TCP_NODELAY set\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connect to 10.152.183.132 port 80 failed: Connection refused\n* Failed to connect to 10.152.183.132 port 80: Connection refused\n\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0\n* Closing connection 0\ncurl: (7) Failed to connect to 10.152.183.132 port 80: Connection refused\n'}.get
 + where {'Code': '7', 'Stderr': '* Trying 10.152.183.132:80...\n* TCP_NODELAY set\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connect to 10.152.183.132 port 80 failed: Connection refused\n* Failed to connect to 10.152.183.132 port 80: Connection refused\n\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0\n* Closing connection 0\ncurl: (7) Failed to connect to 10.152.183.132 port 80: Connection refused\n'} = <Action entity_id="26">.results
------------------------------ Captured log call -------------------------------

```

The bundle we are using is https://oil-jenkins.canonical.com/artifacts/5a82c72a-b16d-449a-82ac-665c660aa4c2/generated/generated/kubernetes-maas/bundle.yaml and these are ARM64 machines. We do not see this problem on AMD64 machines, nor was this happening on ARM64 k8s 1.23.

In the logs we see that this IP belongs to the service `default/hello-world`:

```
3/lxd/1/var/log/syslog-May 18 21:48:19 juju-615a59-3-lxd-1 kube-apiserver.daemon[371087]: I0518 21:48:19.883484 371087 httplog.go:131] "HTTP" verb="LIST" URI="/api/v1/namespaces/default/resourcequotas" latency="2.927089ms" userAgent="kube-apiserver/v1.24.0 (linux/arm64) kubernetes/4ce5a89" audit-ID="cc8735d9-3e98-4635-9802-79c8355b75b6" srcIP="[::1]:35452" apf_pl="exempt" apf_fs="exempt" apf_execution_time="2.691968ms" resp=200
3/lxd/1/var/log/syslog:May 18 21:48:19 juju-615a59-3-lxd-1 kube-apiserver.daemon[371087]: I0518 21:48:19.887041 371087 alloc.go:327] "allocated clusterIPs" service="default/hello-world" clusterIPs=map[IPv4:10.152.183.132]
3/lxd/1/var/log/syslog-May 18 21:48:19 juju-615a59-3-lxd-1 kube-apiserver.daemon[371087]: I0518 21:48:19.887521 371087 httplog.go:131] "HTTP" verb="POST" URI="/api/v1/namespaces/default/services?fieldManager=kubectl-expose" latency="16.499693ms" userAgent="kubectl/v1.24.0 (linux/amd64) kubernetes/4ce5a89" audit-ID="2914d3d0-d685-421b-8d8b-58010b6b1d25" srcIP="10.246.200.115:41676" apf_pl="exempt" apf_fs="exempt" apf_execution_time="15.977931ms" resp=201
```

But there is no further indication why it can't connect.

Logs can be found here:
https://oil-jenkins.canonical.com/artifacts/5a82c72a-b16d-449a-82ac-665c660aa4c2/index.html

Revision history for this message
Marian Gasparovic (marosg) wrote :

Update:

The rest of tests work. There is something wrong with hello-world

$ KUBECONFIG=~/project/generated/kubernetes-maas/kube.conf kubectl create deployment hello-world --image=rocks.canonical.com/cdk/google-samples/node-hello
:1.0
deployment.apps/hello-world created

get-pods

NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-world-64cb8546c5-29sj4 0/1 CrashLoopBackOff 5 (22s ago) 3m11s
15:46

describe pod

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 62s default-scheduler Successfully assigned default/hello-world-64cb8546c5-29sj4 to sqa-lab2-node-3-arm
  Normal Pulled 17s (x4 over 61s) kubelet Container image "rocks.canonical.com/cdk/google-samples/node-hello:1.0" already present on machine
  Normal Created 17s (x4 over 61s) kubelet Created container node-hello
  Normal Started 17s (x4 over 60s) kubelet Started container node-hello
  Warning BackOff 1s (x6 over 59s) kubelet Back-off restarting failed container
15:46

$ KUBECONFIG=~/project/generated/kubernetes-maas/kube.conf kubectl logs hello-world-64cb8546c5-29sj4
exec /bin/sh: exec format error

Revision history for this message
Marian Gasparovic (marosg) wrote :

Information about this working on 1.23 with arm64 may be incorrect (the wrong source is me) because I am not 100% sure now if I ran the suite or finished after k8s was deployed. I think(TM) I ran k8s-suite, but...

Revision history for this message
Adam Dyess (addyess) wrote :

it seems the image used here "node-hello" (https://console.cloud.google.com/gcr/images/google-samples/global/node-hello) was last produced in 2016

I believe it's been *improved* to become gcr.io/google-samples/hello-app:1.0 which is still locked to be only built for amd64. There is an issue on its repo confirming that there are yet no plans to cross build these images for anything other than amd64.

https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/issues/179#issuecomment-971753436

The final solution rests on changing the test to use a multiarch image and updating the tests to check the HTTP responses from that service.

Another solution would be to rebuild the image with multiarch support and push to rocks for our testing.

Revision history for this message
Adam Dyess (addyess) wrote :
Changed in charmed-kubernetes-testing:
status: New → In Progress
assignee: nobody → Mateo Florido (mateoflorido)
importance: Undecided → Medium
Changed in charmed-kubernetes-testing:
status: In Progress → Fix Committed
milestone: none → 1.24+ck1
Adam Dyess (addyess)
Changed in charmed-kubernetes-testing:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.