Okay, I've reproduced the issue. Essentially, I used a small deployment script:
Every 5.0s: timeout 4 juju status -m cinder --color Wed May 11 10:20:12 2022
Model Controller Cloud/Region Version SLA Timestamp
cinder tinwood2-serverstack serverstack/serverstack 2.9.27 unsupported 10:20:12Z
App Version Status Scale Charm Channel Rev Exposed Message
cinder 19.0.0 active 3 cinder stable 530 no Unit is ready
keystone 17.0.1 active 1 keystone stable 539 no Application Ready
percona-cluster 5.7.20 active 1 percona-cluster stable 302 no Unit is ready
rabbitmq-server 3.8.2 active 1 rabbitmq-server stable 123 no Unit is ready
Unit Workload Agent Machine Public address Ports Message
cinder/0* active executing 3 10.5.3.43 8776/tcp Unit is ready
cinder/1 active executing 4 10.5.2.251 8776/tcp Unit is ready
cinder/2 active executing 5 10.5.2.67 8776/tcp Unit is ready
keystone/0* active idle 0 10.5.1.134 5000/tcp Unit is ready
percona-cluster/0* active idle 1 10.5.3.32 3306/tcp Unit is ready
rabbitmq-server/0* active idle 2 10.5.3.182 5672/tcp Unit is ready
- I started it at focal/distro for cinder and keystone.
- I then forced a leadership election to move the leader to a different unit (e.g. 0 -> 1).
- I then did an upgrade from distro (ussuri) -> victoria on cinder.
- Then I forced another leadership election from 1-> 0
- I did another upgrade (victoria -> wallaby) and it was okay.
- I then forced another leadership election to get it to 2.
- I then did an upgrade from wallaby -> xena and triggered the issue.
The show unit for the 3 devices shows that each one has bean the leader and 'done' the upgrade:
i.e. cinder-db-initialised for each unit is that unit's own id with a UUID.
However, as Drew in the comments says, it the cinder-db-initialised-echo keeps bouncing around the units. In the above case, two agree (but this will change with the next hook).
The code in question is:
def check_local_db_actions_complete():
"""Check if we have received db init'd notification and restart services
if we have not already.
NOTE: this must only be called from peer relation context.
"""
if not is_db_initialised():
return
settings = relation_get() or {}
if settings:
init_id = settings.get(CINDER_DB_INIT_RKEY) echoed_init_id = relation_get(unit=local_unit(), attribute=CINDER_DB_INIT_ECHO_RKEY)
# If we have received an init notification from a peer unit
# (assumed to be the leader) then restart cinder-* and echo the
# notification and don't restart again unless we receive a new
# (different) notification.
if is_new_dbinit_notification(init_id, echoed_init_id):
if not is_unit_paused_set(): log("Restarting cinder services following db " "initialisation", level=DEBUG)
for svc in enabled_services(): service_restart(svc)
What I think is happening is that the "init_id = settings.get(CINDER_DB_INIT_RKEY)" assignment is getting a different "cinder-db-initialised" depending on the unit.
Okay, I've reproduced the issue. Essentially, I used a small deployment script:
Every 5.0s: timeout 4 juju status -m cinder --color Wed May 11 10:20:12 2022
Model Controller Cloud/Region Version SLA Timestamp serverstack serverstack/ serverstack 2.9.27 unsupported 10:20:12Z
cinder tinwood2-
App Version Status Scale Charm Channel Rev Exposed Message
cinder 19.0.0 active 3 cinder stable 530 no Unit is ready
keystone 17.0.1 active 1 keystone stable 539 no Application Ready
percona-cluster 5.7.20 active 1 percona-cluster stable 302 no Unit is ready
rabbitmq-server 3.8.2 active 1 rabbitmq-server stable 123 no Unit is ready
Unit Workload Agent Machine Public address Ports Message
cinder/0* active executing 3 10.5.3.43 8776/tcp Unit is ready
cinder/1 active executing 4 10.5.2.251 8776/tcp Unit is ready
cinder/2 active executing 5 10.5.2.67 8776/tcp Unit is ready
keystone/0* active idle 0 10.5.1.134 5000/tcp Unit is ready
percona-cluster/0* active idle 1 10.5.3.32 3306/tcp Unit is ready
rabbitmq-server/0* active idle 2 10.5.3.182 5672/tcp Unit is ready
- I started it at focal/distro for cinder and keystone.
- I then forced a leadership election to move the leader to a different unit (e.g. 0 -> 1).
- I then did an upgrade from distro (ussuri) -> victoria on cinder.
- Then I forced another leadership election from 1-> 0
- I did another upgrade (victoria -> wallaby) and it was okay.
- I then forced another leadership election to get it to 2.
- I then did an upgrade from wallaby -> xena and triggered the issue.
The show unit for the 3 devices shows that each one has bean the leader and 'done' the upgrade:
cinder/0:
cinder- db-initialised: cinder/ 0-c19dc67e- ee4c-4753- 9868-be0e8efa36 da
cinder- db-initialised- echo: cinder/ 1-9717e388- 8b09-4976- 9f0f-4690ee1203 f2
cinder- db-initialised: cinder/ 1-9717e388- 8b09-4976- 9f0f-4690ee1203 f2
cinder- db-initialised- echo: cinder/ 0-c19dc67e- ee4c-4753- 9868-be0e8efa36 da
cinder- db-initialised: cinder/ 2-71063595- 9742-4950- bad6-6a1a8a5a8a b1
cinder- db-initialised- echo: cinder/ 1-9717e388- 8b09-4976- 9f0f-4690ee1203 f2
...
cinder/1:
...
cinder/2:
...
i.e. cinder- db-initialised for each unit is that unit's own id with a UUID.
However, as Drew in the comments says, it the cinder- db-initialised- echo keeps bouncing around the units. In the above case, two agree (but this will change with the next hook).
The code in question is:
def check_local_ db_actions_ complete( ):
"""Check if we have received db init'd notification and restart services
if we have not already.
NOTE: this must only be called from peer relation context. ed():
"""
if not is_db_initialis
return
settings = relation_get() or {} get(CINDER_ DB_INIT_ RKEY)
echoed_ init_id = relation_ get(unit= local_unit( ),
attribute= CINDER_ DB_INIT_ ECHO_RKEY)
if settings:
init_id = settings.
# If we have received an init notification from a peer unit dbinit_ notification( init_id, echoed_init_id): paused_ set():
log(" Restarting cinder services following db "
" initialisation" , level=DEBUG)
service_ restart( svc)
# (assumed to be the leader) then restart cinder-* and echo the
# notification and don't restart again unless we receive a new
# (different) notification.
if is_new_
if not is_unit_
for svc in enabled_services():
# Echo notification
relation_ set(**{ CINDER_ DB_INIT_ ECHO_RKEY: init_id})
What I think is happening is that the "init_id = settings. get(CINDER_ DB_INIT_ RKEY)" assignment is getting a different "cinder- db-initialised" depending on the unit.
I'll debug that and work out how to fix it.