Comment 5 for bug 1929234

Revision history for this message
Gabriel Cocenza (gabrielcocenza) wrote :

I was able to reproduce the bug in a test environment and the logs[1] show that pacemaker and corosync is not able to solve the failed service (kube-scheduler in my case). Rebooting the unit or restarting the service seems to solve the issue, but it's against the propose of the unit heal itself from an eventual failure.

Talking with OpenStack team, the added template [2] to always restart systemd services might cause this conflict. In an environment where pacemaker and corosync is not present, this template fulfills the need of restarting if the service crushes. On the other hand, in a environment where hacluster is present and running, I think it makes sense that the resources should be managed only by pacemaker.

The approach that I'm thinking to implement is to remove the always-restart.conf while configuring the cluster when HA is connected.

[1] https://pastebin.canonical.com/p/p9T2tCdsNn/
[2] https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/master/reactive/kubernetes_master.py#L958-L982