OpenStack HA Cluster Charm

hacluster charm upgrade will not fix existing duplicate VIP issue

Bug #1866145 reported by Trent Lloyd on 2020-03-05

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	In Progress	Medium	Unassigned

Bug Description

* Bug Description *

In Bug #1838528 we fixed an issue where pacemaker resources need to be stopped before being removed. This led to duplicate VIP resource names with the same IP address being created and potentially allocated to different nodes.

The fix for this issue was to stop the resource before deleting it, as they won't delete unless they are stopped.

However this fix only works if hacluster is upgraded first, before an upgrade to the principal charm is also done afterwards. If you upgrade the principal charm first and cause the problem, upgrading hacluster later will not fix it.

* Bug Cause *
This is because the code to reconfigure CRM only executes in the ha_relation_changed function, which is only called in the event of an actual ha-relation-{joined,changed} event or also called from hanode_relation_changed which also is only called in the event of a hanode-relation-changed hook.

This means that the CRM configuration is not re-performed in the event of either upgrade_charm or config_changed - so an environment that upgrades their principal charm first and then hacluster second will trigger the issue and never fix it.

Secondly this generally means that any kind of config change reflected by the ha_relation_changed code won't be applied when made, but may later be applied when a charm just happens to trigger a relation change.

This has been hit in multiple production environments and is critical because the duplicate VIPs cause random problems in the environment.

This problem applies to any charm using hacluster, nova-cloud-controller is only an example.

* Suggested Fix *

We should iterate on ha_relation_changed during upgrade_charm and probably also config_changed.

This is a heavy-weight function though so we should make sure it is actually needed by config-changed but I don't see the harm in using it for upgrade-charm. Though we should make sure it correctly respects and works with the logic used and recommended in the deployment upgrade guide to 'pause' hacluster, etc.

We should also upgrade the openstack charm deployment guide to actually mention upgrading hacluster, right now it is not mentioned:
https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-upgrade-openstack.html#upgrade-order

* Reproduction steps *

(1) deploy a xenial-queens openstack cloud using nova-cloud-controller-316 with hacluster-50 (both before the VIP changes were merged in November 2018) - and wait for complete deployment.
(2) juju upgrade-charm nova-cloud-controller # wait for completion
(3) #observe "crm_mon" on nova-cloud-controller/0 has both res_nova_ens3_vip and res_nova_4c2a33d_vip
(4) juju upgrade-charm hacluster #wait for completion
(%) #observe duplicate VIPs still exist

* Workaround *

You can manually run the ha_relation_joined hook since it iterates over all relations and does not use the context of the currently changed relation.

juju run --application nova-cloud-controller-hacluster ./hooks/ha-relation-changed

Tags:

Trent Lloyd (lathiat) on 2020-03-05

tags:	added: sts
tags:	added: seg

Andrew McLeod (admcleod) on 2020-03-05

tags:	added: charm-upgrade
Changed in charm-hacluster:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

James Page (james-page) wrote on 2020-09-28:

This is a bit of a side effect of the change in behaviour in Juju to not run the config-changed hook after upgrade-charm if configuration has not actually changed.

Iterating the ha relation is fine as part of a charm upgrade hook event:

for rid in hookenv.relation_ids('ha'):
ha_joined(rid)

(this code is executing during config-changed)

Revision history for this message

James Page (james-page) wrote on 2020-09-28:

The above was for nova-cc; the same is true in the hacluster charm as well.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-27: Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/815755