After upgrading kubernetes-master, it is stuck at waiting for master components to start

Bug #1825819 reported by Seyeong Kim
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
High
Peter De Sousa

Bug Description

After upgrading kubernetes-master, it is stuck at "waiting for compoments to start"

I analyzed some code.

1. install_snaps() delete kubernetes-master.components.started flag
2. never set_state kubernetes-master.components.started again after that.

+
It can be reproduced easily via below steps

1. juju deploy kubernetes.yaml(https://pastebin.canonical.com/p/PNJ5fDnVsK/)
2. juju upgrade-charm kubernetes-master

Thanks.

Tags: sts
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

change
remove_state('kubernetes-master.components.started')
to
set_state('..')
fix this issue. but not sure this is proper way to do it.

I'll make PR and get some advice

tags: added: sts
Revision history for this message
George Kraft (cynerva) wrote :

> 1. install_snaps() delete kubernetes-master.components.started flag

This is normal, and correct. This flag is specifically deleted here to cause start_master[1] to run again.

> 2. never set_state kubernetes-master.components.started again after that.

This is not normal -- it's supposed to be set by start_master. It sounds like start_master isn't running.

Output from these commands would help us narrow this down:

juju status --format yaml
juju run --application kubernetes-master -- charms.reactive get_flags
juju debug-log --replay

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/28610d903ac798b3f9df4897b8f5d8bf9652e9fe/reactive/kubernetes_master.py#L654

Changed in charm-kubernetes-master:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

yeah I think start_master wasn't run.

I learned get_flags thing here

uploaded requested command result

Thanks.

Seyeong Kim (seyeongkim)
description: updated
Revision history for this message
Mike Wilson (knobby) wrote :

Thanks for the output, that really helped. Your problem is that you don't have the tls_client.certs.saved flag. I noticed that the latest easyrsa charm is revision 231 and you have revision 192. I believe there was an interface change in that between those, which could result in the error you are seeing. Please upgrade easyrsa and let us know if the problem persists.

I would also not that there are upgrades available for etcd and kubernetes-worker as well.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :
Revision history for this message
Paul Goins (vultaire) wrote :

I'm not convinced that upgrading caused it. It could have been coincidental, as 5 days prior to the observations in that ticket, a kubernetes-master upgrade was done. It's quite possible that the problem described in that bug was actually due to the kubernetes-master upgrade rather than the easyrsa upgrade.

Peter De Sousa (pjds)
Changed in charm-kubernetes-master:
assignee: nobody → Peter De Sousa (pjds)
status: Triaged → In Progress
Revision history for this message
Peter De Sousa (pjds) wrote :

Having looked into this in more detial, the problem seems to be caused by the other version of easy rsa calling set_client_cert[1].

This then updates a dictionary `to_publish_raw` which is picked up by get_client_cert (deprecated) [2].

This is fine, but recent versions of layer-tls-client use a client_cert_maps to provide certificates on a per host basis. This method does not pick up `to_publish_raw`[3, 4].

I am proposing a fix where layer-tls-client checks for per-server certs, and falls back to global certs after failing to find per server certs.

[1] But othe old method updates to_publish_raw:
    https://github.com/charmed-kubernetes/interface-tls-certificates/blob/2fc3f1ee969bad4431b18428993776e82e122309/provides.py#L80
[2] https://github.com/charmed-kubernetes/interface-tls-certificates/blame/2fc3f1ee969bad4431b18428993776e82e122309/requires.py#L164
[3] https://github.com/charmed-kubernetes/interface-tls-certificates/blame/2fc3f1ee969bad4431b18428993776e82e122309/requires.py#L244
[4] https://github.com/charmed-kubernetes/interface-tls-certificates/blame/2fc3f1ee969bad4431b18428993776e82e122309/requires.py#L89

Revision history for this message
Peter De Sousa (pjds) wrote :

Opened Draft PR: https://github.com/juju-solutions/layer-tls-client/pull/18

Leaving cluster over the weekend will run upgrade/use tests.

Revision history for this message
Peter De Sousa (pjds) wrote :
Revision history for this message
Peter De Sousa (pjds) wrote :

Created PR for Kubernetes master to address RBAC concerns: https://github.com/charmed-kubernetes/charm-kubernetes-master/pull/42

George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
milestone: none → 1.16
Revision history for this message
Jay Kuri (jk0ne) wrote :

FWIW: if you are caught with this during an upgrade, upgrading the easyrsa charm will resolve the issue. Once the easyrsa charm finishes updating, the kubernetes-master will see it and it will get unstuck.

Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.