OpenStack Cinder Charm

Comment 13 for bug 1928383

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2022-05-11:

#13

Okay, I've reproduced the issue. Essentially, I used a small deployment script:

Every 5.0s: timeout 4 juju status -m cinder --color Wed May 11 10:20:12 2022

Model Controller Cloud/Region Version SLA Timestamp
cinder tinwood2-serverstack serverstack/serverstack 2.9.27 unsupported 10:20:12Z

App Version Status Scale Charm Channel Rev Exposed Message
cinder 19.0.0 active 3 cinder stable 530 no Unit is ready
keystone 17.0.1 active 1 keystone stable 539 no Application Ready
percona-cluster 5.7.20 active 1 percona-cluster stable 302 no Unit is ready
rabbitmq-server 3.8.2 active 1 rabbitmq-server stable 123 no Unit is ready

- I started it at focal/distro for cinder and keystone.
- I then forced a leadership election to move the leader to a different unit (e.g. 0 -> 1).
- I then did an upgrade from distro (ussuri) -> victoria on cinder.
- Then I forced another leadership election from 1-> 0
- I did another upgrade (victoria -> wallaby) and it was okay.
- I then forced another leadership election to get it to 2.
- I then did an upgrade from wallaby -> xena and triggered the issue.

The show unit for the 3 devices shows that each one has bean the leader and 'done' the upgrade:

      cinder/0:
        ...
        cinder-db-initialised: cinder/0-c19dc67e-ee4c-4753-9868-be0e8efa36da
        cinder-db-initialised-echo: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2
      cinder/1:
        ...
          cinder-db-initialised: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2
          cinder-db-initialised-echo: cinder/0-c19dc67e-ee4c-4753-9868-be0e8efa36da
      cinder/2:
        ...
          cinder-db-initialised: cinder/2-71063595-9742-4950-bad6-6a1a8a5a8ab1
          cinder-db-initialised-echo: cinder/1-9717e388-8b09-4976-9f0f-4690ee1203f2

i.e. cinder-db-initialised for each unit is that unit's own id with a UUID.

However, as Drew in the comments says, it the cinder-db-initialised-echo keeps bouncing around the units. In the above case, two agree (but this will change with the next hook).

The code in question is:

def check_local_db_actions_complete():
"""Check if we have received db init'd notification and restart services
if we have not already.

    NOTE: this must only be called from peer relation context.
    """
    if not is_db_initialised():
        return

    settings = relation_get() or {}
    if settings:
        init_id = settings.get(CINDER_DB_INIT_RKEY)
        echoed_init_id = relation_get(unit=local_unit(),
                                      attribute=CINDER_DB_INIT_ECHO_RKEY)

        # If we have received an init notification from a peer unit
        # (assumed to be the leader) then restart cinder-* and echo the
        # notification and don't restart again unless we receive a new
        # (different) notification.
        if is_new_dbinit_notification(init_id, echoed_init_id):
            if not is_unit_paused_set():
                log("Restarting cinder services following db "
                    "initialisation", level=DEBUG)
                for svc in enabled_services():
                    service_restart(svc)

# Echo notification
relation_set(**{CINDER_DB_INIT_ECHO_RKEY: init_id})

What I think is happening is that the "init_id = settings.get(CINDER_DB_INIT_RKEY)" assignment is getting a different "cinder-db-initialised" depending on the unit.

I'll debug that and work out how to fix it.

Okay, I've reproduced the issue.  Essentially, I used a small deployment script:

Every 5.0s: timeout 4 juju status -m cinder --color                                                                                                                               Wed May 11 10:20:12 2022

Model   Controller            Cloud/Region             Version  SLA          Timestamp
cinder  tinwood2-serverstack  serverstack/serverstack  2.9.27   unsupported  10:20:12Z

App              Version  Status  Scale  Charm            Channel  Rev  Exposed  Message
cinder           19.0.0   active      3  cinder           stable   530  no       Unit is ready
keystone         17.0.1   active      1  keystone         stable   539  no       Application Ready
percona-cluster  5.7.20   active      1  percona-cluster  stable   302  no       Unit is ready
rabbitmq-server  3.8.2    active      1  rabbitmq-server  stable   123  no       Unit is ready

Unit                Workload  Agent      Machine  Public address  Ports     Message   
cinder/0*           active    executing  3        10.5.3.43       8776/tcp  Unit is ready
cinder/1            active    executing  4        10.5.2.251      8776/tcp  Unit is ready
cinder/2            active    executing  5        10.5.2.67       8776/tcp  Unit is ready   
keystone/0*         active    idle       0        10.5.1.134      5000/tcp  Unit is ready   
percona-cluster/0*  active    idle       1        10.5.3.32       3306/tcp  Unit is ready   
rabbitmq-server/0*  active    idle       2        10.5.3.182      5672/tcp  Unit is ready

The show unit for the 3 devices shows that each one has bean the leader and 'done' the upgrade:

i.e. cinder-db-initialised for each unit is that unit's own id with a UUID.

However, as Drew in the comments says, it the cinder-db-initialised-echo keeps bouncing around the units.  In the above case, two agree (but this will change with the next hook).

The code in question is:

def check_local_db_actions_complete():
    """Check if we have received db init'd notification and restart services
    if we have not already.

NOTE: this must only be called from peer relation context.
    """
    if not is_db_initialised():
        return

# Echo notification
            relation_set(**{CINDER_DB_INIT_ECHO_RKEY: init_id})

What I think is happening is that the "init_id = settings.get(CINDER_DB_INIT_RKEY)" assignment is getting a different "cinder-db-initialised" depending on the unit.

I'll debug that and work out how to fix it.