Magnum

[k8s][calico][fedora-atomic] Periodically lost connection from pod to apiserver

Series rocky
Bug #1757325

Bug #1757325 reported by Feilong Wang on 2018-03-21

This bug affects 1 person

	Status	Importance	Assigned to
Magnum	Status tracked in Rocky
Queens	Fix Released	High	Feilong Wang
Rocky	Fix Released	High	Feilong Wang

Bug Description

In my local devstack environment with k8s+calico running on fedora-atomic, I can see many restarts of kubernetes-dashboard and coredns pods. And from the log of kubernetes-dashboard, I can see some error like this:

[fedora@k8scluster-onwg77d4b4qx-master-0 docker.service.d]$ kubectl logs kubernetes-dashboard-846b8b6844-4xmd4 -n kube-system
2018/03/21 01:42:25 Starting overwatch
2018/03/21 01:42:25 Using in-cluster config to connect to apiserver
2018/03/21 01:42:25 Using service account token for csrf signing
2018/03/21 01:42:25 No request provided. Skipping authorization
2018/03/21 01:42:35 Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.254.0.1:443/version: dial tcp 10.254.0.1:443: getsockopt: no route to host
Refer to our FAQ and wiki pages for more information: https://github.com/kubernetes/dashboard/wiki/FAQ

And when using tcpdump -i <calico interface of k8s dashboard> I can see the k8s dashboard loses connection per 8 mins.

Then I used ip monitor and got below output:

9: cali2241f02b2c3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 5
Deleted ff00::/8 dev cali2241f02b2c3 table local metric 256 pref medium
ff00::/8 dev cali2241f02b2c3 table local metric 256 pref medium

10: cali7c86e9585c6 inet6 fe80::8496:10df:436c:e48/64 scope link
valid_lft forever preferred_lft forever
192.168.25.199 dev cali68b21e04cb8 scope link
192.168.25.200 dev cali80465958db0 scope link
192.168.25.201 dev caliae3fbe26c95 scope link
192.168.25.202 dev calif7cbaf34e8b scope link
192.168.25.203 dev calif2e3ad1ce01 scope link
192.168.25.204 dev cali2241f02b2c3 scope link
192.168.25.205 dev cali7c86e9585c6 scope link

see more logs at here http://paste.openstack.org/show/707024/

So obviously, the pod routes are being deleted and recreated. And after checked with calico developer, this is caused by NetworkManager which is doing some 'magic' dynamic interface reconfig for desktop environment and it has side effects in server environments. Though I don't really get why fedora atomic has NetworkManager because IIUC, NetworkManager is only for desktop env.

So the fix will be letting NetworkManager skip the controlling for calico interfaces until we can get rid of it in Fedora atomic image.

Feilong Wang (flwang) on 2018-03-21

Changed in magnum:
assignee:	nobody → Feilong Wang (flwang)

Revision history for this message

Spyros Trigazis (strigazi) wrote on 2018-03-30:

Is there a patch for this?

Revision history for this message

Feilong Wang (flwang) wrote on 2018-05-09:

Fixed by https://review.openstack.org/#/c/548139/

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.