Concurrent migration of vms with the same multiattach volume fails

Bug #1968944 reported by rune32bit
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Low
Unassigned

Bug Description

reproduce:
1. Create multiple vms
2. Create a multiattach volume
3. Attach the volume to all vms
4. Shut down all vms and migrate all vms at the same time
5. It is possible to find that a vm migration failed

The nova-compute log is as follows:
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [req-95d6268a-95eb-4ea2-98e0-a9e973b8f19c cb6c975e503c4b1ca741f64a42d09d50 68dd5eeecb434da0aa5ebcdda19a8db6 - default default] [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] Setting instance vm_state to ERROR: nova.exception.InvalidInput: Invalid input received: Invalid volume: Volume e269257b-831e-4be0-a1e6-fbb2aac922a6 status must be available or in-use or downloading to reserve, but the current status is attaching. (HTTP 400) (Request-ID: req-3515d919-aee2-40f4-887e-d5abb34a9d2e)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] Traceback (most recent call last):
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/nova/volume/cinder.py", line 396, in wrapper
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] res = method(self, ctx, *args, **kwargs)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/nova/volume/cinder.py", line 432, in wrapper
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] res = method(self, ctx, volume_id, *args, **kwargs)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/nova/volume/cinder.py", line 807, in attachment_create
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] instance_uuid=instance_id)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] self.force_reraise()
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] raise self.value
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/nova/volume/cinder.py", line 795, in attachment_create
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] volume_id, _connector, instance_id)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/cinderclient/api_versions.py", line 423, in substitution
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] return method.func(obj, *args, **kwargs)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/cinderclient/v3/attachments.py", line 39, in create
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] retval = self._create('/attachments', body, 'attachment')
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/cinderclient/base.py", line 300, in _create
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] resp, body = self.api.client.post(url, body=body)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/cinderclient/client.py", line 217, in post
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] return self._cs_request(url, 'POST', **kwargs)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/cinderclient/client.py", line 205, in _cs_request
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] return self.request(url, method, **kwargs)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] File "/usr/local/lib/python3.6/site-packages/cinderclient/client.py", line 191, in request
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] raise exceptions.from_response(resp, body)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] cinderclient.exceptions.BadRequest: Invalidvolume: Volume e269257b-831e-4be0-a1e6-fbb2aac922a6 status must be available or in-use or downloading to reserve, but the current status is attaching. (HTTP 400) (Request-ID: req-3515d919-aee2-40f4-887e-d5abb34a9d2e)
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d]
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d] During handling of the above exception, another exception occurred:
2022-04-11 16:49:46.685 23871 ERROR nova.compute.manager [instance: 17fc694e-284a-43f0-b6c6-c640a02db23d]

vm migration Process:
1. Call cinder's attachment_create function on the source node.Set the
 multiattach volume status to reserved.
2. The vm performs the migration operation
3. Call cinder attachment_update function on the destination node, update the connection_info information in the attachment record, and set the status of the volume to attaching
4. Call cinder attachment_complete function on the destination node and set the volume status to in-use

The reason for the failure of vm migration is that the multiattach volume status changes to attaching after the 3 step of the vm migration process. At this time, when another vm migrates to the 1 step, it is judged that the volume status is attaching, which leads to the execution of attachment_create fail.

def _attachment_reserve(self, ctxt, vref, instance_uuid=None):
        # NOTE(jdg): Reserved is a special case, we're avoiding allowing
        # creation of other new reserves/attachments while in this state
        # so we avoid contention issues with shared connections

        # Multiattach of bootable volumes is a special case with it's own
        # policy, check that here right off the bat
        if (vref.get('multiattach', False) and
                vref.status == 'in-use' and
                vref.bootable):
            ctxt.authorize(
                attachment_policy.MULTIATTACH_BOOTABLE_VOLUME_POLICY,
                target_obj=vref)

        # FIXME(JDG): We want to be able to do things here like reserve a
        # volume for Nova to do BFV WHILE the volume may be in the process of
        # downloading image, we add downloading here; that's easy enough but
        # we've got a race between with the attaching/detaching that we do
        # locally on the Cinder node. Just come up with an easy way to
        # determine if we're attaching to the Cinder host for some work or if
        # we're being used by the outside world.
        expected = {'multiattach': vref.multiattach,
                    'status': (('available', 'in-use', 'downloading')
                               if vref.multiattach
                               else ('available', 'downloading'))}

        result = vref.conditional_update({'status': 'reserved'}, expected)

        if not result:
            override = False
            if instance_uuid and vref.status in ('in-use', 'reserved'):
                # Refresh the volume reference in case multiple instances were
                # being concurrently attached to the same non-multiattach
                # volume.
                vref = objects.Volume.get_by_id(ctxt, vref.id)
                for attachment in vref.volume_attachment:
                    # If we're attaching the same volume to the same instance,
                    # we could be migrating the instance to another host in
                    # which case we want to allow the reservation.
                    # (LP BUG: 1694530)
                    if attachment.instance_uuid == instance_uuid:
                        override = True
                        break

            if not override:
                msg = (_('Volume %(vol_id)s status must be %(statuses)s to '
                         'reserve, but the current status is %(current)s.') %
                       {'vol_id': vref.id,
                        'statuses': utils.build_or_str(expected['status']),
                        'current': vref.status})
                raise exception.InvalidVolume(reason=msg)

        values = {'volume_id': vref.id,
                  'volume_host': vref.host,
                  'attach_status': 'reserved',
                  'instance_uuid': instance_uuid}
        db_ref = self.db.volume_attach(ctxt.elevated(), values)
        return objects.VolumeAttachment.get_by_id(ctxt, db_ref['id'])

rune32bit (none2021)
tags: added: migration multiattach
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Seems valid indeed, we somehow need to ensure the concurrency mechanism verifies the correct attachment status.

tags: added: volumes
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/880921

Changed in nova:
status: Confirmed → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.