Charmed Kubernetes Testing

test_clusterip_service_endpoint fails on connection error in k8s 1.24 on ARM

Bug #1974207 reported by Bas de Bruijne on 2022-05-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Charmed Kubernetes Testing	Fix Released	Medium	Mateo Florido	Charmed Kubernetes Testing 1.24+ck1

Bug Description

In testrun https://solutions.qa.canonical.com/testruns/testRun/5a82c72a-b16d-449a-82ac-665c660aa4c2, with FCE output https://oil-jenkins.canonical.com/job/fce_build/2153//console we see test_clusterip_service_endpoint fail:

```
=================================== FAILURES ===================================
_______________________ test_clusterip_service_endpoint ________________________
Traceback (most recent call last):
  File "/home/ubuntu/k8s-validation/jobs/integration/test_service_endpoints.py", line 124, in test_clusterip_service_endpoint
    raise e
  File "/home/ubuntu/k8s-validation/jobs/integration/test_service_endpoints.py", line 121, in test_clusterip_service_endpoint
    assert "Hello Kubernetes!" in action.results.get("Stdout", "")
AssertionError: assert 'Hello Kubernetes!' in ''
+ where '' = <built-in method get of dict object at 0x7fd2bfb0f440>('Stdout', '')
+ where <built-in method get of dict object at 0x7fd2bfb0f440> = {'Code': '7', 'Stderr': '* Trying 10.152.183.132:80...\n* TCP_NODELAY set\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connect to 10.152.183.132 port 80 failed: Connection refused\n* Failed to connect to 10.152.183.132 port 80: Connection refused\n\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0\n* Closing connection 0\ncurl: (7) Failed to connect to 10.152.183.132 port 80: Connection refused\n'}.get
+ where {'Code': '7', 'Stderr': '* Trying 10.152.183.132:80...\n* TCP_NODELAY set\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connect to 10.152.183.132 port 80 failed: Connection refused\n* Failed to connect to 10.152.183.132 port 80: Connection refused\n\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0\n* Closing connection 0\ncurl: (7) Failed to connect to 10.152.183.132 port 80: Connection refused\n'} = <Action entity_id="26">.results
------------------------------ Captured log call -------------------------------

```

The bundle we are using is https://oil-jenkins.canonical.com/artifacts/5a82c72a-b16d-449a-82ac-665c660aa4c2/generated/generated/kubernetes-maas/bundle.yaml and these are ARM64 machines. We do not see this problem on AMD64 machines, nor was this happening on ARM64 k8s 1.23.

In the logs we see that this IP belongs to the service `default/hello-world`:

```
3/lxd/1/var/log/syslog-May 18 21:48:19 juju-615a59-3-lxd-1 kube-apiserver.daemon[371087]: I0518 21:48:19.883484 371087 httplog.go:131] "HTTP" verb="LIST" URI="/api/v1/namespaces/default/resourcequotas" latency="2.927089ms" userAgent="kube-apiserver/v1.24.0 (linux/arm64) kubernetes/4ce5a89" audit-ID="cc8735d9-3e98-4635-9802-79c8355b75b6" srcIP="[::1]:35452" apf_pl="exempt" apf_fs="exempt" apf_execution_time="2.691968ms" resp=200
3/lxd/1/var/log/syslog:May 18 21:48:19 juju-615a59-3-lxd-1 kube-apiserver.daemon[371087]: I0518 21:48:19.887041 371087 alloc.go:327] "allocated clusterIPs" service="default/hello-world" clusterIPs=map[IPv4:10.152.183.132]
3/lxd/1/var/log/syslog-May 18 21:48:19 juju-615a59-3-lxd-1 kube-apiserver.daemon[371087]: I0518 21:48:19.887521 371087 httplog.go:131] "HTTP" verb="POST" URI="/api/v1/namespaces/default/services?fieldManager=kubectl-expose" latency="16.499693ms" userAgent="kubectl/v1.24.0 (linux/amd64) kubernetes/4ce5a89" audit-ID="2914d3d0-d685-421b-8d8b-58010b6b1d25" srcIP="10.246.200.115:41676" apf_pl="exempt" apf_fs="exempt" apf_execution_time="15.977931ms" resp=201
```

But there is no further indication why it can't connect.

Logs can be found here:
https://oil-jenkins.canonical.com/artifacts/5a82c72a-b16d-449a-82ac-665c660aa4c2/index.html

Revision history for this message

Marian Gasparovic (marosg) wrote on 2022-05-20:

Update:

The rest of tests work. There is something wrong with hello-world

$ KUBECONFIG=~/project/generated/kubernetes-maas/kube.conf kubectl create deployment hello-world --image=rocks.canonical.com/cdk/google-samples/node-hello
:1.0
deployment.apps/hello-world created

get-pods

NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-world-64cb8546c5-29sj4 0/1 CrashLoopBackOff 5 (22s ago) 3m11s
15:46

describe pod

$ KUBECONFIG=~/project/generated/kubernetes-maas/kube.conf kubectl logs hello-world-64cb8546c5-29sj4
exec /bin/sh: exec format error

Update:

The rest of tests work. There is something wrong with hello-world

$ KUBECONFIG=~/project/generated/kubernetes-maas/kube.conf kubectl create deployment hello-world --image=rocks.canonical.com/cdk/google-samples/node-hello
:1.0                                                                                                                                                                                                                                   
deployment.apps/hello-world created

get-pods

NAMESPACE                         NAME                                                      READY   STATUS             RESTARTS       AGE
default                           hello-world-64cb8546c5-29sj4                              0/1     CrashLoopBackOff   5 (22s ago)    3m11s
15:46

describe pod

Events:                                                                                                                                                                                                                                
  Type     Reason     Age                From               Message                                                                                                                                                                    
  ----     ------     ----               ----               -------                                                                                                                                                                    
  Normal   Scheduled  62s                default-scheduler  Successfully assigned default/hello-world-64cb8546c5-29sj4 to sqa-lab2-node-3-arm
  Normal   Pulled     17s (x4 over 61s)  kubelet            Container image "rocks.canonical.com/cdk/google-samples/node-hello:1.0" already present on machine
  Normal   Created    17s (x4 over 61s)  kubelet            Created container node-hello                                                                                                                                               
  Normal   Started    17s (x4 over 60s)  kubelet            Started container node-hello                                                                                                                                               
  Warning  BackOff    1s (x6 over 59s)   kubelet            Back-off restarting failed container
15:46

$ KUBECONFIG=~/project/generated/kubernetes-maas/kube.conf kubectl logs hello-world-64cb8546c5-29sj4
exec /bin/sh: exec format error

Revision history for this message

Marian Gasparovic (marosg) wrote on 2022-05-20:

Information about this working on 1.23 with arm64 may be incorrect (the wrong source is me) because I am not 100% sure now if I ran the suite or finished after k8s was deployed. I think(TM) I ran k8s-suite, but...

Revision history for this message

Adam Dyess (addyess) wrote on 2022-05-20:

it seems the image used here "node-hello" (https://console.cloud.google.com/gcr/images/google-samples/global/node-hello) was last produced in 2016

I believe it's been *improved* to become gcr.io/google-samples/hello-app:1.0 which is still locked to be only built for amd64. There is an issue on its repo confirming that there are yet no plans to cross build these images for anything other than amd64.

https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/issues/179#issuecomment-971753436

The final solution rests on changing the test to use a multiarch image and updating the tests to check the HTTP responses from that service.

Another solution would be to rebuild the image with multiarch support and push to rocks for our testing.

Revision history for this message

Adam Dyess (addyess) wrote on 2022-05-23:

Resolution PR:
https://github.com/charmed-kubernetes/jenkins/pull/923

Changed in charmed-kubernetes-testing:
status:	New → In Progress
assignee:	nobody → Mateo Florido (mateoflorido)
importance:	Undecided → Medium

Kevin W Monroe (kwmonroe) on 2022-06-18

Changed in charmed-kubernetes-testing:
status:	In Progress → Fix Committed
milestone:	none → 1.24+ck1

Adam Dyess (addyess) on 2022-08-04

Changed in charmed-kubernetes-testing:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-googlecloudplatform-kubernetes-engine-samples #179
[open type: bug priority: p3] Edit

Bug watches keep track of this bug in other bug trackers.