Kubernetes on OpenStack cannot successfully create loadbalancers when running as a non cloud admin

Bug #1840421 reported by Lorenzo Cavassa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
High
Cory Johns
Openstack Integrator Charm
Fix Released
High
Cory Johns

Bug Description

We are using Kubernetes 1.14 deployed via CDK charms (master: cs:~containers/kubernetes-master-700, worker: cs:~containers/kubernetes-worker-552; openstack integrator: cs:~containers/openstack-integrator-8).
Our openstack integrator is configured with credentials (i.e. not using juju trust). Importantly, the project-name config value is not the OpenStack admin project but a management project (and in other scenarios will be a customer project).
OpenStack is Rocky with Octavia - latest charms.
We can stand up load balancers in OpenStack via CLI fine however, we get failures when standing these up from K8S. This is the error:
Every 2.0s: kubectl describe service ohc-mgmt-prometheus-grafana -n ohc-monitoringjuju-97d08c-0-lxd-1: Wed Aug 14 13:28:53 2019

Name:ohc-mgmt-prometheus-grafana
Namespace:ohc-monitoring
Labels:app=grafana
chart=grafana-3.7.3
heritage=Tiller
release=ohc-mgmt-prometheus
Annotations:<none>
Selector:app=grafana,release=ohc-mgmt-prometheus
Type:LoadBalancer
IP:10.152.183.187
Port:service 80/TCP
TargetPort:3000/TCP
NodePort:service 32470/TCP
Endpoints:10.1.44.5:3000
Session Affinity:None
External Traffic Policy: Cluster
Events:
Type ReasonAgeFromMessage
---- ---------------------
Normal EnsuringLoadBalancer2m5s (x20 over 101m) service-controller Ensuring load balancer
Warning CreatingLoadBalancerFailed 38s (x11 over 64m) service-controller (combined from similar events): Error creating load balancer (will retry): failed to ensure load balancer for service ohc-monitoring/ohc-mgmt-prometheus-graf
ana: Error occurred updating port 10db85cd-811d-40fa-93ed-753db98d29a1 for loadbalancer service ohc-monitoring/ohc-mgmt-prometheus-grafana: Resource not found

We have hacked some of the neutron code to spit out better log messages, and we get the following when this happens:

2019-08-13 22:24:37.633 1863379 INFO neutron.pecan_wsgi.hooks.translation [req-140d3d7b-5a3e-438c-b5d8-b403e5af7c1f 3801cc85c85549a0ad21d5acc490a0fd 08d811a3068248acb1ce950d851def88 - 172176bb0ba34beca92b17508ce29803 172176bb0ba34beca92b1
7508ce29803] Security group ccb5f7b0-9eaf-4038-b804-54149d8800bf does not exist
2019-08-13 22:24:37.634 1863379 INFO neutron.pecan_wsgi.hooks.translation [req-140d3d7b-5a3e-438c-b5d8-b403e5af7c1f 3801cc85c85549a0ad21d5acc490a0fd 08d811a3068248acb1ce950d851def88 - 172176bb0ba34beca92b17508ce29803 172176bb0ba34beca92b1
7508ce29803] PUT failed (client error): The resource could not be found.

This security group refers to the Octavia created security group for the FIP which is created in the services_domain/services project. This security group is already added to the port by Octavia. We believe the issue is that the old embedded openstack cloud provider code is being used by the charms. The k8s OS cloud provider (after creating the LB) creates its own security group and tries to add it to the existing Octavia generated security group. This PUT therefore contains both security groups (the Octavia created one from the services project and the new one from our management project) and fails because Neutron is validating the security groups and the Octavia created one cannot be accessed by the management project scoped token that K8S is using.
Kubernetes therefore aborts the load balancer, issues a delete and tears it back down again. See the code in the OpenStack Cloud Provider here - https://github.com/kubernetes/cloud-provider-openstack/blob/release-1.13/pkg/cloudprovider/providers/openstack/openstack_loadbalancer.go#L1231

This looks very similar to this bug https://bugs.launchpad.net/octavia/+bug/1627780.

We have confirmed that the management project user cannot see the octavia generated security group as an "openstack security group show xxxx" of the lb-* security group Octavia created fails with the similar error that K8s Gets ("Error while executing command: No SecurityGroup found for 9bede94a-2121-4d76-8bf4-8014751eaf39") whilst the same user scoped into the admin project can see it.

This *may* be fixed by the external cloud provider code as we note the code is different and does not try to create a security group on the VIP if Octavia is detected, however we have not tested this.

Tags: atos
Revision history for this message
Lorenzo Cavassa (lorenzo-cavassa) wrote :

From looking at the code, the edge charms set a flag in the cloud-provider config files that says that Octavia is used as the lbaas. The logic is then different for Octavia vs non-Octavia Lbaas and skips the security group issue we are having .... or at least that's the theory as i can't get them to work!

See: https://github.com/charmed-kubernetes/layer-kubernetes-common/blame/master/lib/charms/layer/kubernetes_common.py#L456

and how the logic in the openstack cloud provider is linked to this:

https://github.com/kubernetes/cloud-provider-openstack/blame/release-1.14/pkg/cloudprovider/providers/openstack/openstack_loadbalancer.go#L1441

Revision history for this message
Lorenzo Cavassa (lorenzo-cavassa) wrote :

K8S team and they have suggested upgrading to the latest edge charms and K8S edge channel. This switches the in-tree openstack provider for an external openstack provider.

But hit other completely unrelated issues on trying to stand up the k8s cluster. The cluster doesn't really form - the openstack-cloud-controller-manager pods all stay in CrashLoopBackOff and I can't get the logs from them as doing so gives me "Error from server: no preferred addresses found; known addresses: []". So presume I'm hitting unrelated issues trying the edge solution.

The issue seems to be that the cloud-config secret contains the following path to the CA cert:
[Global]
auth-url = https://auth.ohc01.customerb.internal:5000/v3
...
ca-file = /etc/kubernetes/openstack-ca.cer

whereas the pod config is mounting the cloud-config secret at /etc/config, therefore I think the ca-file will actually be at /etc/config/endpoint-ca.cert (different path and filename)

Revision history for this message
Lorenzo Cavassa (lorenzo-cavassa) wrote :

This was a change in Neutron Server between 13.0.2 and 13.0.4 to securitygroups.py effectively broke the mechanism that non-Octavia aware K8S Cloud Provider used to stand up load balancers due to new security restrictions in the security group applied to the FIP. I haven't followed all the code, but I suspect it was this commit: https://github.com/openstack/neutron/commit/2eb31f84c9a6c9fc6340819f756a7a82cbf395f3

Revision history for this message
Cory Johns (johnsca) wrote :

Relevant PRs:

https://github.com/juju-solutions/interface-openstack-integration/pull/11
https://github.com/juju-solutions/charm-openstack-integrator/pull/18
https://github.com/charmed-kubernetes/layer-kubernetes-common/pull/4

Once the integrator charm change is in stable, the latter PR will also be back-ported to the K8s stable charms as well.

Note that while this will fix things for the Octavia case, presumably older OpenStack deployments with only Neutron-based LBaaS would still be broken, but I don't see a way around that beyond patching either upstream OpenStack or upstream Kubernetes. And I'm not aware that this is a blocker with older OpenStacks, at this point.

Changed in charm-openstack-integrator:
status: New → In Progress
Changed in charm-kubernetes-master:
status: New → In Progress
assignee: nobody → Cory Johns (johnsca)
Changed in charm-openstack-integrator:
assignee: nobody → Cory Johns (johnsca)
Cory Johns (johnsca)
Changed in charm-kubernetes-master:
importance: Undecided → High
Changed in charm-openstack-integrator:
importance: Undecided → High
Revision history for this message
Lorenzo Cavassa (lorenzo-cavassa) wrote :

the issue affects the charm-kubernetes-worker as well

Revision history for this message
Cory Johns (johnsca) wrote :

Still working on testing this on Octavia, but have confirmed (via serverstack) that it does properly avoid setting the flag when Octavia is not, in fact, available, as well as confirming that this leads to hitting the upstream bug (which can't really be avoided in that case, since using Octavia isn't an option).

Until I can complete my testing, this is available via the edge channel (cs:~containers/openstack-integrator-23) and can be used with the edge channel of the charmed-kubernetes bundle (cs:~containers/charmed-kubernetes-218).

Revision history for this message
Cory Johns (johnsca) wrote :

All testing is finished and fix confirmed. The fix has also been backported to stable charms and will be available in the next bug-fix release.

Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charm-openstack-integrator:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
milestone: none → 1.15+ck2
Changed in charm-openstack-integrator:
milestone: none → 1.15+ck2
Revision history for this message
Cory Johns (johnsca) wrote :

The integrator charm portion has been released to its stable channel as cs:~containers/openstack-integrator-24

The corresponding K8s changes will become available with the next bug-fix release, as previously mentioned.

Changed in charm-openstack-integrator:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-master:
milestone: 1.15+ck2 → 1.16
Changed in charm-openstack-integrator:
milestone: 1.15+ck2 → 1.16
milestone: 1.16 → none
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.