[OSSA-2023-003] Unauthorized volume access through deleted volume attachments (CVE-2023-2088)

Bug #2004555 reported by Jan Wasilewski
294
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Unassigned
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned
Antelope
Fix Released
Undecided
Unassigned
Wallaby
Fix Committed
Undecided
Unassigned
Xena
Fix Committed
Undecided
Unassigned
Yoga
Fix Released
Undecided
Unassigned
Zed
Fix Released
Undecided
Unassigned
OpenStack Security Advisory
Fix Released
High
Jeremy Stanley
OpenStack Security Notes
Fix Released
High
Jeremy Stanley
glance_store
Fix Released
Undecided
Unassigned
kolla-ansible
In Progress
Undecided
Unassigned
Zed
Fix Released
Undecided
Unassigned
os-brick
In Progress
Undecided
Unassigned

Bug Description

Hello OpenStack Security Team,

I’m writing to you, as we faced a serious security breach in OpenStack functionality(correlated a bit with libvirt, iscsi and huawei driver). I was going through OSSA documents and correlated libvirt notes, but I couldn't find something similar. It is not related to https://security.openstack.org/ossa/OSSA-2020-006.html

In short: we observed that newly created cinder volume(1GB size) was attached to compute node instance, but an instance recognized it as a 115GB volume, which(this 115GB volume) in fact was connected to another instance on the same compute node.

[1. Test environment]
Compute node: OpenStack Ussuri configured with Huawei dorado as a storage backend(configuration driver is available here: https://docs.openstack.org/cinder/rocky/configuration/block-storage/drivers/huawei-storage-driver.html)
Packages:
v# dpkg -l | grep libvirt
ii libvirt-clients 6.0.0-0ubuntu8.16 amd64 Programs for the libvirt library
ii libvirt-daemon 6.0.0-0ubuntu8.16 amd64 Virtualization daemon
ii libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.16 amd64 Virtualization daemon QEMU connection driver
ii libvirt-daemon-driver-storage-rbd 6.0.0-0ubuntu8.16 amd64 Virtualization daemon RBD storage driver
ii libvirt-daemon-system 6.0.0-0ubuntu8.16 amd64 Libvirt daemon configuration files
ii libvirt-daemon-system-systemd 6.0.0-0ubuntu8.16 amd64 Libvirt daemon configuration files (systemd)
ii libvirt0:amd64 6.0.0-0ubuntu8.16 amd64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:21.2.4-0ubuntu1 all OpenStack Compute - compute node libvirt support
ii python3-libvirt 6.1.0-1 amd64 libvirt Python 3 bindings

# dpkg -l | grep qemu
ii ipxe-qemu 1.0.0+git-20190109.133f4c4-0ubuntu3.2 all PXE boot firmware - ROM images for qemu
ii ipxe-qemu-256k-compat-efi-roms 1.0.0+git-20150424.a25a16d-0ubuntu4 all PXE boot firmware - Compat EFI ROM images for qemu
ii libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.16 amd64 Virtualization daemon QEMU connection driver
ii qemu 1:4.2-3ubuntu6.23 amd64 fast processor emulator, dummy package
ii qemu-block-extra:amd64 1:4.2-3ubuntu6.23 amd64 extra block backend modules for qemu-system and qemu-utils
ii qemu-kvm 1:4.2-3ubuntu6.23 amd64 QEMU Full virtualization on x86 hardware
ii qemu-system-common 1:4.2-3ubuntu6.23 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-data 1:4.2-3ubuntu6.23 all QEMU full system emulation (data files)
ii qemu-system-gui:amd64 1:4.2-3ubuntu6.23 amd64 QEMU full system emulation binaries (user interface and audio support)
ii qemu-system-x86 1:4.2-3ubuntu6.23 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 1:4.2-3ubuntu6.23 amd64 QEMU utilities

# dpkg -l | grep nova
ii nova-common 2:21.2.4-0ubuntu1 all OpenStack Compute - common files
ii nova-compute 2:21.2.4-0ubuntu1 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:21.2.4-0ubuntu1 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:21.2.4-0ubuntu1 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:21.2.4-0ubuntu1 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:17.0.0-0ubuntu1 all client library for OpenStack Compute API - 3.x

# dpkg -l | grep multipath
ii multipath-tools 0.8.3-1ubuntu2 amd64 maintain multipath block device access

# dpkg -l | grep iscsi
ii libiscsi7:amd64 1.18.0-2 amd64 iSCSI client shared library
ii open-iscsi 2.0.874-7.1ubuntu6.2 amd64 iSCSI initiator tools

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"

Instance OS: Debian-11-amd64

[2. Test scenario]
Already created instance with two volumes attached. First - 10GB for root system, second - 115GB used as vdb. Recognized by compute node as vda - dm-11, vdb - dm-9:

# virsh domblklist 90fas439-fc0e-4e22-8d0b-6f2a18eee5c1
 Target Source
----------------------
 vda /dev/dm-11
 vdb /dev/dm-9

# multipath -ll
(...)
36e00084100ee7e7ed6ad25d900002f6b dm-9 HUAWEI,XSG1
size=115G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 14:0:0:4 sdm 8:192 active ready running
  |- 15:0:0:4 sdo 8:224 active ready running
  |- 16:0:0:4 sdl 8:176 active ready running
  `- 17:0:0:4 sdn 8:208 active ready running
(...)
36e00084100ee7e7ed6acaa2900002f6a dm-11 HUAWEI,XSG1
size=10G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 14:0:0:3 sdq 65:0 active ready running
  |- 15:0:0:3 sdr 65:16 active ready running
  |- 16:0:0:3 sdp 8:240 active ready running
  `- 17:0:0:3 sds 65:32 active ready running

Creating a new instance, with the same OS guest system and 10GB root volume. After successful deployment, creating a new volume(1GB) size and attaching it to newly created instance. We can see after that:

# multipath -ll
(...)
36e00084100ee7e7ed6ad25d900002f6b dm-9 HUAWEI,XSG1
size=115G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 14:0:0:10 sdao 66:128 failed faulty running
  |- 14:0:0:4 sdm 8:192 active ready running
  |- 15:0:0:10 sdap 66:144 failed faulty running
  |- 15:0:0:4 sdo 8:224 active ready running
  |- 16:0:0:10 sdan 66:112 failed faulty running
  |- 16:0:0:4 sdl 8:176 active ready running
  |- 17:0:0:10 sdaq 66:160 failed faulty running
  `- 17:0:0:4 sdn 8:208 active ready running

This way at instance we were able to see a new drive - not 1GB, but 115GB -> so it seems it was incorrectly attached and this way we were able to destroy some data on that volume.

Additionaly we were able to see many errors like that in compute node logs:

# dmesg -T | grep dm-9
[Fri Jan 27 13:37:42 2023] blk_update_request: critical target error, dev dm-9, sector 62918760 op 0x1:(WRITE) flags 0x8800 phys_seg 2 prio class 0
[Fri Jan 27 13:37:42 2023] blk_update_request: critical target error, dev dm-9, sector 33625152 op 0x1:(WRITE) flags 0x8800 phys_seg 6 prio class 0
[Fri Jan 27 13:37:46 2023] blk_update_request: critical target error, dev dm-9, sector 66663000 op 0x1:(WRITE) flags 0x8800 phys_seg 5 prio class 0
[Fri Jan 27 13:37:46 2023] blk_update_request: critical target error, dev dm-9, sector 66598120 op 0x1:(WRITE) flags 0x8800 phys_seg 5 prio class 0
[Fri Jan 27 13:37:51 2023] blk_update_request: critical target error, dev dm-9, sector 66638680 op 0x1:(WRITE) flags 0x8800 phys_seg 12 prio class 0
[Fri Jan 27 13:37:56 2023] blk_update_request: critical target error, dev dm-9, sector 66614344 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[Fri Jan 27 13:37:56 2023] blk_update_request: critical target error, dev dm-9, sector 66469296 op 0x1:(WRITE) flags 0x8800 phys_seg 24 prio class 0
[Fri Jan 27 13:37:56 2023] blk_update_request: critical target error, dev dm-9, sector 66586472 op 0x1:(WRITE) flags 0x8800 phys_seg 3 prio class 0
(...)

Unfortunately we do not know what is a perfect test-scenario for it as we could face such issue in less than 2% of our tries, but it looks like a serious security breach.

Additionally we observed that linux kernel is not fully clearing a device allocation(from volume detach), so some of drives names are visible in an output, i.e. lsblk command. Then, after new volume attachment we can see such names(i.e. sdao, sdap, sdan and so on) are reusable by that drive and wrongly mapped by multipath/iscsi to another drive and this way we hit an issue.
Our question is why linux kernel of compute node is not removing devices allocation and this way is leading to a scenario like that? Maybe this can be a solution here.

Thanks in advance for your help and understanding. In case when more details is needed, do not hesitate to contact me.

CVE References

Revision history for this message
Jeremy Stanley (fungi) wrote :

Since this report concerns a possible security risk, an incomplete
security advisory task has been added while the core security
reviewers for the affected project or projects confirm the bug and
discuss the scope of any vulnerability along with potential
solutions.

description: updated
Changed in ossa:
status: New → Incomplete
Revision history for this message
Dan Smith (danms) wrote :

I feel like this is almost certainly something that will require involvement from the cinder people. Nova's part in the volume attachment is pretty minimal, in that we get stuff from cinder, pass it to brick, and then configure the guest with the block device we're told (AFAIK). Unless we're messing up the last step, I think it's likely this is not just a Nova thing. Should we add cinder or brick as an affected project or just add some cinder people to the bug here?

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

> Should we add cinder or brick as an affected project or just add some cinder people to the bug here?

I'd be in favor of adding the cinder project which would pull the cinder coresec team, right?

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

In the meantime, could you please provide us the block device mapping information that's stored in the DB and ideally the cinder-side attachment information ?

Putting the bug report to Incomplete, please mark its status back to New when you reply.

Changed in nova:
status: New → Incomplete
Revision history for this message
Jan Wasilewski (janwasilewski) wrote :
Download full text (45.5 KiB)

Hi,

below you can find requested information from OpenStack DB. There is no issue right now, but maybe historical tracking could list to some hint? Anyway, issue was related with /dev/vdb drive for instance: 128f1398-a7c5-48f8-8bbc-a132e3e2d556 -> in DB output you can observe that size of volume is 15GB, when directly from instance it was reported as 115GB(so vdb of second instance presented in this output)

mysql> select * from block_device_mapping where instance_uuid = '90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1';
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+
| created_at | updated_at | deleted_at | id | device_name | delete_on_termination | snapshot_id | volume_id | volume_size | no_device | connection_info | instance_uuid | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id | ta...

Changed in nova:
status: Incomplete → New
Revision history for this message
Jeremy Stanley (fungi) wrote :

I've added Cinder as an effected project (though maybe it should be os-brick?) and subscribed the Cinder security reviewers for additional input.

Revision history for this message
Rajat Dhasmana (whoami-rajat) wrote :

Hi,

Based on the given information, the strange part is same multipath device is used for the old and new volume 36e00084100ee7e7ed6ad25d900002f6b

36e00084100ee7e7ed6ad25d900002f6b dm-9 HUAWEI,XSG1
size=115G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 14:0:0:4 sdm 8:192 active ready running
  |- 15:0:0:4 sdo 8:224 active ready running
  |- 16:0:0:4 sdl 8:176 active ready running
  `- 17:0:0:4 sdn 8:208 active ready running

36e00084100ee7e7ed6ad25d900002f6b dm-9 HUAWEI,XSG1
size=115G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 14:0:0:10 sdao 66:128 failed faulty running
  |- 14:0:0:4 sdm 8:192 active ready running
  |- 15:0:0:10 sdap 66:144 failed faulty running
  |- 15:0:0:4 sdo 8:224 active ready running
  |- 16:0:0:10 sdan 66:112 failed faulty running
  |- 16:0:0:4 sdl 8:176 active ready running
  |- 17:0:0:10 sdaq 66:160 failed faulty running
  `- 17:0:0:4 sdn 8:208 active ready running

Also it's interesting to note that the paths under the multipath device (sdm, sdo, sdl, sdn) with LUN ID: 4 are also used by the second multipath device whereas it should use LUN 10 paths (which is currently in failed faulty status).

This looks multipath related but it would be helpful if we can get the os-brick logs for this 1GB volume attachment to understand if os-brick is doing something that is resulting in this.

I would also recommend to cleanup the system with any leftover devices of past failed detachments (i.e. flush and remove mpath devices not belonging to any instance) that might be interfering with this. Although I'm not certain if that's the case, it's still to cleanup those devices.

Revision history for this message
Gorka Eguileor (gorka) wrote :
Download full text (3.2 KiB)

Hi,

I think I know what happened, but there are some things that don't match unless
somebody has manually changed some things in the host (like cleaning up
multipaths).

Bit of context:

- SCSI volumes (iSCSI and FC) on Linux are NEVER removed automatically by the
  kernel and must always be removed explicitly. This means that they will
  remain in the system even if the remote connection is severed, unless
  something in OpenStack removes it.

- The os-brick library has a strong policy of not removing devices from the
  system if flushing fails during detach, to prevent data loss.

  The `disconnect_volume` method in the os-brick library has an additional
  parameter called `force` to allow callers to ignore flushing errors and
  ensure that the devices are being removed. This is useful when after failing
  the detach the volume is either going to be deleted or into error status.

I don't have the logs, but from what you said my guess is that this is what has
happened:

- Volume with SCSI ID 36e00084100ee7e7ed6ad25d900002f6b was attached to that
  host on LUN 10 at some point since the last reboot (sdao, sdap, sdan, sdaq).

- When detaching the volume from the host using os-brick the operation failed
  and it wasn't removed, yet Nova still called Cinder to unexport and unmap the
  volume. At this point LUN 10 is free on the Huawei array and the volume is
  no longer attacheable, but /dev/sda[o-q] are still present, and their SCSI_ID
  are still known to multipathd.

- Nova asked Cinder to attach the volume again, and the volume is mapped to LUN
  4 (which must have been available as well) and it successfully attaches (sdm,
  sdo, sdl, sdn), appears as a multipath, and is used by the VM.

- Nova asks Cinder to export and map the new 1GB volume, and Huawai maps it to
  LUN 10, at this point iSCSI detects that the remote LUNs are back and
  reconnects to them, which makes the multipathd path checker detect sdao,
  sdap, sdan, sdaq are alive on the compute host and they are added to the
  existing multipath device mapper using their known SCSI ID.

You should find out why the detach actually failed, but I think I see multiple
issues:

- Nova:

  - Should not call Cinder to unmap a volume if the os-brick to disconnect the
    volume has failed, as we know this will leave leftover devices that can
    cause issues like this.

  - If it's not already doing it, Nova should call disconnect_volume method
    from os-brick passing force=True when the volume is going to be deleted.

- os-brick:

  - Should try to detect when the newly added devices are being added to a
    multipath device mapper that has live paths to other LUNs and fail if that
    is the case.

  - As an improvement over the previous check, os-brick could forcefully remove
    those devices that are in the wrong device mapper, force a refresh of their
    SCSI IDs and add them back to multipathd to form a new device mapper.
    Though personally this is a non trivial and maybe potentially problematic
    feature.

In other words, the source of the problem is probably Nova, but os-brick should
try to prevent these possible data leaks.

Cheers,
Gorka.

[1]: https://github.com/opens...

Read more...

Revision history for this message
Dan Smith (danms) wrote :

I don't see in the test scenario description that any instances had to be deleted or volumes disconnected for this to happen. Maybe the reporter can confirm with logs if this is the case?

I'm still chasing down the nova calls, but we don't ignore anything in the actual disconnect other than "volume not found". I need to follow that up to where we call cinder to see if we're ignoring a failure.

When you say "nova should call disconnect_volume with force=true if the volume is going to be deleted... I'm not sure what you mean by this. Do you mean if we're disconnecting because of *instance* delete and are sure that we don't want to let a failure hold us up? I would think this would be dangerous because just deleting an instance doesn't mean you don't care about the data in the volume.

It seems to me that if brick *has* the information available to it to avoid connecting a volume to the wrong location, that it's the thing that needs to guard against this. Nova has no knowledge of the things underneath brick, so we don't know that wires are going to get crossed. Obviously if we can do stuff to avoid even getting there, then we should.

Revision history for this message
Jan Wasilewski (janwasilewski) wrote :

Hi,

I'm just wondering if there is a chance for me to try to reproduce an issue again with all debug flags set to on. Should I turn on this flag on controllers(cinder, nova) or compute node logs(with debug flags set to on) should be enough to further troubleshoot this issue? If yes, please let me know which flags are needed here, just to speed up further troubleshooting. As I said - this case is not easy to reproduce, I can't even say what is a trigger here, but we faced it 3 or 4 times already.

Thanks in advance for reply and your helps so far.

Best regards,
Jan

Revision history for this message
Gorka Eguileor (gorka) wrote :

Apologies if I wasn't clear enough.

The disconnect call I say it's probably being ignored/swallowed is the one to os-brick, not Cinder. In other words, Nova first calls os-brick to disconnect the volume from the compute host and then always considers this as successful (at least in some scenarios, probably instance destruction). Since it always considers in those scenarios that local disconnect was successful it calls Cinder to unmap/unexport the volume.

The force=True parameter to os-brick's disconnect_volume should only be added when the BDM for the volume has the delete on disconnect flag thingy.

OS-Brick has the information, the problem is that multipathd is the one that is adding the leftover devices that have been reused to the multipath device mapper.

Revision history for this message
Gorka Eguileor (gorka) wrote :

A solution/workaround would be to change /etc/multipath.conf and set "recheck_wwid" to yes.

I haven't actually tested it myself, but the documentation explicitly calls out that it's used to solve this specific issue: "If set to yes, when a failed path is restored, the multipathd daemon rechecks the path WWID. If there is a change in the WWID, the path is removed from the current multipath device, and added again as a new path. The multipathd daemon also checks the path WWID again if it is manually re-added."

I believe this is probably something that is best fixed at the deployment tool level. For example extending the multipathing THT template code [1] to support "recheck_wwid" and defaulting it to yes instead to no like multipath.conf does.

[1]: https://opendev.org/openstack/tripleo-heat-templates/commit/906d03ea19a4446ed198c321f68791b7fa6e0c47

Revision history for this message
Dan Smith (danms) wrote :

Okay, thanks for the clarification.

Yeah, recheck_wwid seems like it should *always* be on to prevent potentially reconnecting to the wrong thing!

Revision history for this message
Jeremy Stanley (fungi) wrote :

If that configuration ends up being the recommended solution, we might want to consider drafting a brief security note with guidance for deployers and maintainers of deployment tooling.

Unless I misunderstand the conditions necessary, it sounds like it would be challenging for a malicious user to force this problem to occur. Is that the current thinking? If so, we could probably safely work on the actual text of the note in public.

1 comments hidden view all 271 comments
Revision history for this message
melanie witt (melwitt) wrote :

> The disconnect call I say it's probably being ignored/swallowed is the one to os-brick, not Cinder. In other words, Nova first calls os-brick to disconnect the volume from the compute host and then always considers this as successful (at least in some scenarios, probably instance destruction). Since it always considers in those scenarios that local disconnect was successful it calls Cinder to unmap/unexport the volume.

I just checked and indeed Nova will ignore a volume disconnect error in the case of an instance being deleted [1]:

    try:
        self._disconnect_volume(context, connection_info, instance)
    except Exception as exc:
        with excutils.save_and_reraise_exception() as ctxt:
            if cleanup_instance_disks:
                # Don't block on Volume errors if we're trying to
                # delete the instance as we may be partially created
                # or deleted
                ctxt.reraise = False
                LOG.warning(
                    "Ignoring Volume Error on vol %(vol_id)s "
                    "during delete %(exc)s",
                    {'vol_id': vol.get('volume_id'),
                     'exc': encodeutils.exception_to_unicode(exc)},
                    instance=instance)

In all other scenarios, Nova will not proceed further if the disconnect was not successful.

If Nova does proceed past _disconnect_volume(), it will later call Cinder API to delete the attachment [2]. I assume that is what does the unmap/unexport.

[1] https://github.com/openstack/nova/blob/1bf98f128710c374a0141720a7ccc21f5d1afae0/nova/virt/libvirt/driver.py#L1445-L1459 (ussuri)
[2] https://github.com/openstack/nova/blob/1bf98f128710c374a0141720a7ccc21f5d1afae0/nova/compute/manager.py#L2922 (ussuri)

Revision history for this message
Jan Wasilewski (janwasilewski) wrote :

I believe it can be a bit challenging for ubuntu users to introduce recheck_wwid parameter. What I checked already, this parameter is available for multipath-tools, but the package which provides it is on-board with ubuntu 22.04LTS. Older ubuntu releases do not have this possibility and gives an error:
/etc/multipath.conf line XX, invalid keyword: recheck_wwid

I made such assumption based on release documentation:
- for ubuntu 20.04: https://manpages.ubuntu.com/manpages/focal/en/man5/multipath.conf.5.html
- for ubuntu 22.04: https://manpages.ubuntu.com/manpages/jammy/en/man5/multipath.conf.5.html

So it seems that partially Yoga, but fully Zed OS release can take such parameter directly, but older releases should manage such change differently.

I know that OpenStack code is independent of linux distros, but just wanted to add this info here, as worth to consider.

Revision history for this message
Gorka Eguileor (gorka) wrote :

I don't know if my assumption is correct or not, because I can't reproduce the multipath device mapper situation from the report (some failed some active) no matter how much I force things to break in different ways.

Since each iSCSI storage backend behaves differently I don't know if I can't reproduce it because the difference in behavior or because the way I'm trying to reproduce it is different. It may even be that multipathd is different in my system.

Unfortunately I don't know if the host where that happened had leftover devices before the leak happened, or what the SCSI IDs of the 2 volumes involved really are.

From os-brick's connect_volume perspective what it did is the right thing, because when it looked at the multipath device containing the newly connected devices it was dm-9, so that's the one that it should return.

How multipath ended up with 2 different volumes in the same device mapper, I don't know.

I don't think "recheck_wwid" would solve the issue because os-brick would be too fast in finding the multipath and it wouldn't give enough time for multipathd to activate the paths and form a new device mapper.

In any case I strongly believe that nova should never proceed to delete the cinder attachment if detaching with os-brick fails because that usually implies data loss.

The exception would be when the cinder volume is going to be delete after disconnecting it, and in that case the disconnect call to os-brick should be always forced, since data loss is irrelevant.

That would ensure that compute nodes are not left with leftover devices that could cause problems.

I'll see if I can find a reasonable improvement in os-brick that would detect this issues and fail the connection, although it's probably going to be a bit of a mess.

Revision history for this message
Jan Wasilewski (janwasilewski) wrote :
Download full text (6.3 KiB)

@Gorka Eguileor: I can try to reproduce this case with recheck_wwid option set to true when a valid package of multipath-tools will be available for ubuntu 20.04.

What I can add is that it happened only for one compute node, but I've seen similar warnings in other compute nodes in dmesg -T output, which looks dangerously, but so far I haven't faced similar issue there:

[Thu Feb 9 14:28:16 2023] scsi_io_completion: 42 callbacks suppressed
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 CDB: Read(10) 28 00 03 bf ff 00 00 00 08 00
[Thu Feb 9 14:28:16 2023] print_req_error: 42 callbacks suppressed
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914304
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 CDB: Read(10) 28 00 03 bf ff 00 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914304
[Thu Feb 9 14:28:16 2023] buffer_io_error: 30 callbacks suppressed
[Thu Feb 9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686976, async page read
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 CDB: Read(10) 28 00 03 bf ff 01 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914305
[Thu Feb 9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686977, async page read
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 CDB: Read(10) 28 00 03 bf ff 02 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914306
[Thu Feb 9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686978, async page read
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 CDB: Read(10) 28 00 03 bf ff 03 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req...

Read more...

Revision history for this message
Gorka Eguileor (gorka) wrote :

Don't bother trying with recheck_wwid, as it won't work due to the speed of os-brick.

Revision history for this message
Gorka Eguileor (gorka) wrote :

I have finally been able to reproduce the issue.

So far I have been able to identify 3 different ways to create similar situations to the reported one, and it was what I thought, leftover devices from a 'nova delete' call.

Took me longer to figure it out because it requires an iSCSI Cinder driver that uses shared targets, and the one I use doesn't.

After I locally modified the cinder driver code to do target sharing and then force a disconnect error on specific Nova calls to os-brick I was able to work it out.

I have a local patch that detects these issues and fixes them the best it can, but I wouldn't like to backport that, because the fixing is a bit scary as a backport.

So I'll split the code into 2 patches:

- The backportable patch that detects and prevents the connection if a potential leak is detected. To fix this manual intervention will be necessary.

- Another patch that extends the previous code to try to fix things when possible.

Revision history for this message
melanie witt (melwitt) wrote :

> In any case I strongly believe that nova should never proceed to delete the cinder attachment if detaching with os-brick fails because that usually implies data loss.

> The exception would be when the cinder volume is going to be delete after disconnecting it, and in that case the disconnect call to os-brick should be always forced, since data loss is irrelevant.

> That would ensure that compute nodes are not left with leftover devices that could cause problems.

Understood. I guess that must mean that the reported bug scenario is a volume that is *not* delete_on_termination=True attached to an instance that is being deleted.

I think we could probably propose a patch in nova to not delete the attachment if it's instance delete + not delete_on_termination.

Revision history for this message
Gorka Eguileor (gorka) wrote :

Hi Melanie,

In my opinion there should be 2 code changes to prevent leaving devices behind:

- Instance deletion operation should fail like the normal volume-detach call when the disconnect_volume call fails, even if the instance is left in a "weird" state, manual intervention is usually necessary to fix things.
  This manual intervention does not necessarily mean doing something to the volume, it can be fixing the network.

- Any Cinder volume with delete_on_termination=True should have the os-brick call to disconnect_volume with "force=True, ignore_errors=True" parameters.
  The tricky part here is that not all os-brick connectors support the force parameter, so when the call fails we have to decide whether to halt the operation and wait for human intervention, or just log it and continue as we are doing today.
  We could make an effort in os-brick to increase coverage of the force parameter.

Thanks,
Gorka.

Revision history for this message
Dan Smith (danms) wrote :

Our policy is that instance delete should never fail, and I think that's the experience the users expect. Perhaps we need to still mark the instance deleted immediately and continue retrying the volume detach in a periodic until it succeeds, but that's the only thing I can see working.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Agree with Dan, we shouldn't raise an exception on instance delete but rather possibly make some status available for knowing whether the volume was eventually detached.

For example, we accept to delete an instance if the compute goes down (as the user may not know that the underlying compute is in a bad state) and we only delete the instance when the compute is back.

That being said, I don't really see how we can easily fix this in a patch as we should discuss this correctly. Would a LOG statement adverting that the volume connection is still present would help ?

1 comments hidden view all 271 comments
Revision history for this message
melanie witt (melwitt) wrote :
Download full text (3.2 KiB)

We definitely should not allow a delete to fail from a user's perspective.

My suggestion of a patch to not delete an attachment when detach fails during instance delete if delete_on_termination=False is intended to be better than what we have today, not necessarily to be perfect.

We could consider doing a periodic like Dan mentions. We already do similar with our "cleanup running deleted instances" periodic. The volume attachment cleanup could be hooked into that if it doesn't already do it.

From what I can tell, our periodic is already capable of taking care of it, but it's not enabled [1][2]:

    elif action == 'reap':
        LOG.info("Destroying instance with name label "
                 "'%s' which is marked as "
                 "DELETED but still present on host.",
                 instance.name, instance=instance)
        bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
            context, instance.uuid, use_slave=True)
        self.instance_events.clear_events_for_instance(instance)
        try:
            self._shutdown_instance(context, instance, bdms,
                                    notify=False)
            self._cleanup_volumes(context, instance, bdms,
                                  detach=False)

    def _cleanup_volumes(self, context, instance, bdms, raise_exc=True,
                         detach=True):
        original_exception = None
        for bdm in bdms:
            if detach and bdm.volume_id:
                try:
                    LOG.debug("Detaching volume: %s", bdm.volume_id,
                              instance_uuid=instance.uuid)
                    destroy = bdm.delete_on_termination
                    self._detach_volume(context, bdm, instance,
                                        destroy_bdm=destroy)
                except Exception as exc:
                    original_exception = exc
                    LOG.warning('Failed to detach volume: %(volume_id)s '
                                'due to %(exc)s',
                                {'volume_id': bdm.volume_id, 'exc': exc})

            if bdm.volume_id and bdm.delete_on_termination:
                try:
                    LOG.debug("Deleting volume: %s", bdm.volume_id,
                              instance_uuid=instance.uuid)
                    self.volume_api.delete(context, bdm.volume_id)
                except Exception as exc:
                    original_exception = exc
                    LOG.warning('Failed to delete volume: %(volume_id)s '
                                'due to %(exc)s',
                                {'volume_id': bdm.volume_id, 'exc': exc})
        if original_exception is not None and raise_exc:
            raise original_exception

Currently we're calling _cleanup_volumes with detach=False. Not sure what the reason for that is but if we determine there should be no problems with it, we can change it to detach=True in combination with not deleting the attachment on instance delete if delete_on_termination=False.

[1] https://github.com/openstack/nova/blob/a2964417822bd1a4a83fa5c27282d2be1e18868a/nova/compute/manager.py#L10579
[2] https://github.com/openstack/nova/blob/a2964417822bd1a4a83f...

Read more...

Revision history for this message
Gorka Eguileor (gorka) wrote :

What is the reason why Nova has the policy that deleting the instance should never fail?

I'm talking about the instance record, not the VM itself, because I agree that the VM should always be deleted to free resources.

From my perspective deleting the instance record would result in a very weird user experience and in users manually creating the same situation we are trying to avoid.

- User requests instance deletion
- Calls to disconnect_volume fails
- Nova removes everything it can and at the end even the instance record, while it keeps trying to disconnect the device in the background.
- User wants to use the volume again but sees that it's in-use in Cinder
- Looks for the instance in Nova thinking that something may have gone wrong, but not seeing it there thinks it's a problem between cinder and nova.
- Runs the `cinder delete-attachment` command to return the volume to available state.

We end up in the same situation as we were before, with leftover devices.

Revision history for this message
Dan Smith (danms) wrote :

Because the user wants to delete a thing in our supposed "elastic infrastructure". They want their quota back, they want to stop being billed for it, they want the IP for use somewhere else, or whatever. They don't care that we can't delete it because of some backend failure - that's not their problem. That's why we have the ability to queue the delete even if the compute is down - that's how important it is.

It's also not at all about deleting the VM, it's about the instance going away from the perspective of the user (i.e. marking the instance record as deleted). The instance record is what determines if they're billed for it, if their quota is used, etc. We "charge" the user the same whether the VM is running or not. Further, even if we have stopped the VM, we cannot re-assign the resources committed to that VM until the deletion completes in the backend. Another scenario that infuriates operators is "I've deleted a thing, the compute node should be clear, but the scheduler tells me I can't boot something else there."

Your example workflow is exactly why I feel like the solution to this problem can't (entirely) be one of preventing a delete if we fail to detach. Because the admins will just force-delete/detach/reset-state/whatever until things free up (as I would expect to do myself). Especially if the user is demanding that they get their quota back, stop being billed, and/or attach the volume somewhere else.

It seems to me that there *must* be some way to ensure that we never attach a volume to the wrong place. Regardless of how we get there, there must be some positive affirmation that we're handing precious volume data to the right person.

Revision history for this message
Gorka Eguileor (gorka) wrote :

The quota/billing issue is a matter of Nova code. In cinder we resolve it by having a flag for resources (volume and snapshots) to reflect whether they consume quota or not.

The same thing could be done in Nova to reflect what resources are actually consumed by the instance (IPs, VMs, GPUs, etc) and therefore billable.

Users not caring about backend errors would be, in my opinion, naive thinking on their part, since they DO CARE about their persistent data being properly written and they want to avoid data loss, data corruption, and data leakage above all else.

I assume users would also want to have a consistent view of their resources, so if a volume says it's attached to an instance the instance should still exist, otherwise there is an invalid reference.

Data leak/corruption may be prevented in some cases with the code I'm working on for os-brick (although some drivers are missing the feature required), but that won't prevent data loss. For that Nova would need to do the sensible thing.

I'm going to do some additional testings today, because this report is about something that happens accidentally, but I believe there is a way to actually exploit this to gain access to other users data. Though fixing that would require yet another bunch of code.

In other words, there are 3 different to fix here:

- Nova doing the right thing to prevent data corruption/leak/loss.
- os-brick detection of the right volume to prevent data leak.
- Prevent intentional data leak.

Revision history for this message
Jeremy Stanley (fungi) wrote :

If there is indeed a way for a normal user (not an operator) of the environment to cause this information leak to happen and then take advantage of it, we should find a way to prevent at least that aspect before making this report public.

If it's not a condition that a normal user can intentionally cause to happen, then it's probably fine to fix this in public instead.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Gorka, Nova even doesn't really know about the Cinder backends, it just uses os-brick.

So, when Nova asks to attach a volume, only os-brick knows whether it's the right volume. That's why I think it's important to have brick to be able to say 'no'.

Revision history for this message
Dan Smith (danms) wrote :

Right, we have to trust os-brick to give us a block device that is actually the thing we're supposed to attach to the guest.

I'm really concerned about what sounds like a very loose association between what we pass to brick from cinder and what we get back from brick in terms of a block device. Isn't there some way for brick to walk the multipath device and the backing iSCSI/FC devices to check WWNs or something to ensure that it's consistent and points to what we expect?

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

> If there is indeed a way for a normal user (not an operator) of the environment to cause this information leak to happen and then take advantage of it, we should find a way to prevent at least t hat aspect before making this report public.

Well, I'm trying hard to find a possible attack vector from a malicious user and I don't see any.
I don't disagree with the bug report as it can potentially leak data to any instance, but I don't know how someone could take benefit of this information.

Here, I'm just one voice and I leave others to chime in, but I'm in favor of making this report public so we can discuss the potential solutions with the stakeholders and any operator having concerns about it.

Revision history for this message
Gorka Eguileor (gorka) wrote :

Let me summarize things:

1. The source of the problem reported in this bug is that Nova has been doing something wrong since forever. I've been bringing this up for the past 7 years, and every single time we end up in the same place, nova giving priority to instance deletion over everything else.

2. There are some things that os-brick can do to try to detect when Nova doesn't do its job right, but this is equivalent to a taxi driver asking passengers to learn to fall because the car is not going to stop when they want to get off. It's a lot harder to do and it doesn't sound all that reasonable.

3. There is an attack vector that can be exploited and it's pretty easy to do (I've done it locally) but it's separate from the issue reported here and it hasn't existed for as long as the that one. I would resolve this in a different way than the workaround mentioned in #2.

Seeing as we are back to the same conversation of the past 7 years, we'll probably end up in the same place, so I'll just do my best to resolve the attack vector and also introduce code to resolve Nova's mistakes.

Revision history for this message
Gorka Eguileor (gorka) wrote :

Oh, I failed to clarify something. The user exploit case can be made secure (as far as I can tell), but for the scenario in this bug's description, the only secure solution is fixing nova, the os-brick code I'm working on will only reduce the window were the data is leaked or can be corrupted.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Gorka, I don't want to debate on projects's responsibility, but I'd rather focus on the data leakage, which is the subject of this security report.

The fact that a volume detach can leave residue if a flush error occurs is certainly not ideal, but this isn't a security problem *UNTIL* the remaining devices are reused.
To me, it appears that the data leal occurs on the attach and not on the detach and I'd rather prefer to see os-brick avoiding this situation.

That being said, I think Melanie, Dan and I agreed on trying to find a way to asynchronously clean up the devices (see comments #24 #25 and #27) and that can be discussed publicly, but again, this won't help with the data leakage that occurs on the attach command.

Revision history for this message
Dan Smith (danms) wrote :

Okay Gorka and I just had a nice long chat about things and I think we made some progress on understanding the (several) ways we can get into this situation and came up with some action items. I'll try to summarize here and I'll look for Gorka to correct me if I get anything wrong.

I think that we're now on the same page that delete of a running instance is much more of a forceful act than some might think, and that we expect to try to be graceful with that, but with a limited amount of patience before we kill it with fire. That maps to us actually always calling force=True when we do the detachment. Even with force=True, brick *tries* to flush and disconnect gracefully, but if it can't, will cut things off at the knees. Thus, if we did force=True now, we wouldn't get into the situation the bug describes because we would *definitely* have cleaned up at that point.

It sounds like there are some robustification steps that can be made in brick to do more validation of the full chain from instance->multipathd->iscsi->volume when we're doing attachments to try to avoid getting into the situation described by this bug, so Gorka is going to work on that.

Gorka also described another way to get into this situation, which is much more exploitable by the user, and I'll let him describe it in more detail. But the short story is that cinder should not let users delete attachments for instances that nova says are running (i.e. not deleted).

Multipathd, while well-intentioned, also has some behavior that is counterproductive when recovering from various situations where paths to a device get disconnected. Enabling the recheck_wwid thing in multipathd should be a recommended flag to have enabled to reduce the likelihood of that happening. Especially in the case where nova has allowed a blind delete due to a downed compute node, we need multipathd to not "help" by reattaching things without extra checks.

So, the action items roughly are:

1. Nova should start passing force=True in our call to brick detach for instance delete
2. Recommend the recheck_wwid flag for multipathd, and get deployment tools to enable it
3. Robustification of brick's attach workflow to do some extra sanity checks
4. Cinder should refuse to allow users to delete an attachment for an active volume

Based on the cinder user-exploitable attack vector, it sounds to me like we should keep this bug private on that basis until we have at least the cinder/nova validation step in place. We could create another one for just that scenario, but publicizing the accidental scenario and discussion we have in this bug now might be enough of a suggestion that more people would figure out the user-oriented attack.

Revision history for this message
Gorka Eguileor (gorka) wrote :

Sylvain, the data leak/corruption presented in this bug report is caused by the detach on the nova side.

It may happen when we do the attach, but it is 100% caused by the detach problem, so just focusing on the attach part is not right considering the RCA is the leftover devices from the detach.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Gorka, I eventually understood all the problems we have and what Dan wrote at comment #38 look good to me as action items.

Yeah, we need to keep this bug private for a bit until we figure out a solid plan for fixing those 4 items and yeah, we need to both force-delete the attachment while we also try to solidify the attachment calls.

Jeremy Stanley (fungi)
description: updated
summary: - [ussuri] Wrong volume attachment - volumes overlapping when connected
- through iscsi on host
+ Unauthorized volume access through deleted volume attachments
+ (CVE-2023-2088)
Changed in ossa:
status: Incomplete → In Progress
importance: Undecided → High
assignee: nobody → Jeremy Stanley (fungi)
Jeremy Stanley (fungi)
description: updated
information type: Private Security → Public Security
Changed in ossn:
assignee: nobody → Jeremy Stanley (fungi)
importance: Undecided → High
status: New → In Progress
Changed in glance-store:
status: New → In Progress
Changed in cinder:
status: New → In Progress
Jeremy Stanley (fungi)
summary: - Unauthorized volume access through deleted volume attachments
- (CVE-2023-2088)
+ [OSSA-2023-003] Unauthorized volume access through deleted volume
+ attachments (CVE-2023-2088)
Changed in os-brick:
status: New → In Progress
Changed in nova:
status: New → In Progress
Changed in ossa:
status: In Progress → Fix Released
Jeremy Stanley (fungi)
Changed in ossn:
status: In Progress → Fix Released
Changed in glance-store:
status: In Progress → Fix Released
tags: added: in-stable-yoga
Changed in nova:
status: In Progress → Fix Released
tags: added: in-stable-zed
Changed in cinder:
status: In Progress → Fix Released
Changed in kolla-ansible:
status: New → In Progress
tags: added: in-stable-wallaby
tags: added: in-stable-xena
191 comments hidden view all 271 comments
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to glance_store (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/glance_store/+/882908
Committed: https://opendev.org/openstack/glance_store/commit/712eb6df3b79009b49c0cf075675d75f14281914
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 712eb6df3b79009b49c0cf075675d75f14281914
Author: Brian Rosmaita <email address hidden>
Date: Wed May 10 20:17:36 2023 -0400

    Update 'extras' for cinder driver

    Raise the min version of os-brick to include the fix for
    CVE-2023-2088.

    Change-Id: Ic8bc4d7ae7e38eca65be01184add7ae1ca377a22
    Related-bug: #2004555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/882867
Committed: https://opendev.org/openstack/nova/commit/b574901500d936488cdedf9fda90c4d36eeddd97
Submitter: "Zuul (22348)"
Branch: stable/xena

commit b574901500d936488cdedf9fda90c4d36eeddd97
Author: melanie witt <email address hidden>
Date: Wed Feb 15 22:37:40 2023 +0000

    Use force=True for os-brick disconnect during delete

    The 'force' parameter of os-brick's disconnect_volume() method allows
    callers to ignore flushing errors and ensure that devices are being
    removed from the host.

    We should use force=True when we are going to delete an instance to
    avoid leaving leftover devices connected to the compute host which
    could then potentially be reused to map to volumes to an instance that
    should not have access to those volumes.

    We can use force=True even when disconnecting a volume that will not be
    deleted on termination because os-brick will always attempt to flush
    and disconnect gracefully before forcefully removing devices.

    Conflicts:
        nova/tests/unit/virt/libvirt/volume/test_lightos.py
        nova/virt/libvirt/volume/lightos.py

    NOTE(melwitt): The conflicts are because change
    Ic314b26695d9681d31a18adcec0794c2ff41fe71 (Lightbits LightOS driver) is
    not in Xena.

    Closes-Bug: #2004555

    Change-Id: I3629b84d3255a8fe9d8a7cea8c6131d7c40899e8
    (cherry picked from commit db455548a12beac1153ce04eca5e728d7b773901)
    (cherry picked from commit efb01985db88d6333897018174649b425feaa1b4)
    (cherry picked from commit 8b4b99149a35663fc11d7d163082747b1b210b4d)
    (cherry picked from commit 4d8efa2d196f72fdde33136a0b50c4ee8da3c941)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/882868
Committed: https://opendev.org/openstack/nova/commit/6cc4e7fb9ac49606c598e72fcd3d6cf02efac4f1
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 6cc4e7fb9ac49606c598e72fcd3d6cf02efac4f1
Author: melanie witt <email address hidden>
Date: Tue May 9 03:11:25 2023 +0000

    Enable use of service user token with admin context

    When the [service_user] section is configured in nova.conf, nova will
    have the ability to send a service user token alongside the user's
    token. The service user token is sent when nova calls other services'
    REST APIs to authenticate as a service, and service calls can sometimes
    have elevated privileges.

    Currently, nova does not however have the ability to send a service user
    token with an admin context. This means that when nova makes REST API
    calls to other services with an anonymous admin RequestContext (such as
    in nova-manage or periodic tasks), it will not be authenticated as a
    service.

    This adds a keyword argument to service_auth.get_auth_plugin() to
    enable callers to provide a user_auth object instead of attempting to
    extract the user_auth from the RequestContext.

    The cinder and neutron client modules are also adjusted to make use of
    the new user_auth keyword argument so that nova calls made with
    anonymous admin request contexts can authenticate as a service when
    configured.

    Related-Bug: #2004555

    Change-Id: I14df2d55f4b2f0be58f1a6ad3f19e48f7a6bfcb4
    (cherry picked from commit 41c64b94b0af333845e998f6cc195e72ca5ab6bc)
    (cherry picked from commit 1f781423ee4224c0871ab4aafec191bb2f7ef0e4)
    (cherry picked from commit 0d6dd6c67f56c9d4ed36246d14f119da6bca0a5a)
    (cherry picked from commit 98c3e3707c08a07f7ca5996086b165512f604ad6)

Revision history for this message
Zakhar Kirpichenko (kzakhar) wrote :

I apologize for the late response. My volumes are Ceph RBD, not sure which driver Nova uses internally.

Thanks for your feedback and fixes, everyone!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/glance_store 4.3.1

This issue was fixed in the openstack/glance_store 4.3.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/glance_store 3.0.1

This issue was fixed in the openstack/glance_store 3.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/glance_store 4.1.1

This issue was fixed in the openstack/glance_store 4.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 25.2.0

This issue was fixed in the openstack/nova 25.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.2.0

This issue was fixed in the openstack/nova 26.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 22.1.0

This issue was fixed in the openstack/cinder 22.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 27.1.0

This issue was fixed in the openstack/nova 27.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 20.3.0

This issue was fixed in the openstack/cinder 20.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 21.3.0

This issue was fixed in the openstack/cinder 21.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/883360

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/882869
Committed: https://opendev.org/openstack/nova/commit/5b4cb92aa8adab2bd3d7905e0b76eceab680ab28
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 5b4cb92aa8adab2bd3d7905e0b76eceab680ab28
Author: melanie witt <email address hidden>
Date: Wed Feb 15 22:37:40 2023 +0000

    Use force=True for os-brick disconnect during delete

    The 'force' parameter of os-brick's disconnect_volume() method allows
    callers to ignore flushing errors and ensure that devices are being
    removed from the host.

    We should use force=True when we are going to delete an instance to
    avoid leaving leftover devices connected to the compute host which
    could then potentially be reused to map to volumes to an instance that
    should not have access to those volumes.

    We can use force=True even when disconnecting a volume that will not be
    deleted on termination because os-brick will always attempt to flush
    and disconnect gracefully before forcefully removing devices.

    Conflicts:
        nova/tests/unit/virt/libvirt/volume/test_lightos.py
        nova/virt/libvirt/volume/lightos.py

    NOTE(melwitt): The conflicts are because change
    Ic314b26695d9681d31a18adcec0794c2ff41fe71 (Lightbits LightOS driver) is
    not in Xena.

    NOTE(melwitt): The difference from the cherry picked change is because
    of the following additional affected volume driver in Wallaby:
        * nova/virt/libvirt/volume/net.py

    Closes-Bug: #2004555

    Change-Id: I3629b84d3255a8fe9d8a7cea8c6131d7c40899e8
    (cherry picked from commit db455548a12beac1153ce04eca5e728d7b773901)
    (cherry picked from commit efb01985db88d6333897018174649b425feaa1b4)
    (cherry picked from commit 8b4b99149a35663fc11d7d163082747b1b210b4d)
    (cherry picked from commit 4d8efa2d196f72fdde33136a0b50c4ee8da3c941)
    (cherry picked from commit b574901500d936488cdedf9fda90c4d36eeddd97)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ossa (master)

Reviewed: https://review.opendev.org/c/openstack/ossa/+/883202
Committed: https://opendev.org/openstack/ossa/commit/136b24c5ddfaff6f4957af9bc9b84fa1b7deb6e3
Submitter: "Zuul (22348)"
Branch: master

commit 136b24c5ddfaff6f4957af9bc9b84fa1b7deb6e3
Author: Jeremy Stanley <email address hidden>
Date: Mon May 15 18:52:55 2023 +0000

    Add errata 3 for OSSA-2023-003

    Since this only impacts the fix for stable/wallaby which is not
    under normal maintenance, we'll dispose with the usual errata
    announcements.

    Change-Id: Ibd0d1d796012fb5d34d48925ce34f6f1c300b54e
    Related-Bug: #2004555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/882870
Committed: https://opendev.org/openstack/nova/commit/48150a6fbab7e2a7b9fbeaa39110d0e6f7f37aaf
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 48150a6fbab7e2a7b9fbeaa39110d0e6f7f37aaf
Author: melanie witt <email address hidden>
Date: Tue May 9 03:11:25 2023 +0000

    Enable use of service user token with admin context

    When the [service_user] section is configured in nova.conf, nova will
    have the ability to send a service user token alongside the user's
    token. The service user token is sent when nova calls other services'
    REST APIs to authenticate as a service, and service calls can sometimes
    have elevated privileges.

    Currently, nova does not however have the ability to send a service user
    token with an admin context. This means that when nova makes REST API
    calls to other services with an anonymous admin RequestContext (such as
    in nova-manage or periodic tasks), it will not be authenticated as a
    service.

    This adds a keyword argument to service_auth.get_auth_plugin() to
    enable callers to provide a user_auth object instead of attempting to
    extract the user_auth from the RequestContext.

    The cinder and neutron client modules are also adjusted to make use of
    the new user_auth keyword argument so that nova calls made with
    anonymous admin request contexts can authenticate as a service when
    configured.

    Related-Bug: #2004555

    Change-Id: I14df2d55f4b2f0be58f1a6ad3f19e48f7a6bfcb4
    (cherry picked from commit 41c64b94b0af333845e998f6cc195e72ca5ab6bc)
    (cherry picked from commit 1f781423ee4224c0871ab4aafec191bb2f7ef0e4)
    (cherry picked from commit 0d6dd6c67f56c9d4ed36246d14f119da6bca0a5a)
    (cherry picked from commit 98c3e3707c08a07f7ca5996086b165512f604ad6)
    (cherry picked from commit 6cc4e7fb9ac49606c598e72fcd3d6cf02efac4f1)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to os-brick (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-brick/+/883951

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/xena)
Download full text (3.2 KiB)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/882839
Committed: https://opendev.org/openstack/cinder/commit/68fdc323369943f494541a3510e71290b091359f
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 68fdc323369943f494541a3510e71290b091359f
Author: Gorka Eguileor <email address hidden>
Date: Thu Feb 16 15:57:15 2023 +0100

    Reject unsafe delete attachment calls

    Due to how the Linux SCSI kernel driver works there are some storage
    systems, such as iSCSI with shared targets, where a normal user can
    access other projects' volume data connected to the same compute host
    using the attachments REST API.

    This affects both single and multi-pathed connections.

    To prevent users from doing this, unintentionally or maliciously,
    cinder-api will now reject some delete attachment requests that are
    deemed unsafe.

    Cinder will process the delete attachment request normally in the
    following cases:

    - The request comes from an OpenStack service that is sending the
      service token that has one of the roles in `service_token_roles`.
    - Attachment doesn't have an instance_uuid value
    - The instance for the attachment doesn't exist in Nova
    - According to Nova the volume is not connected to the instance
    - Nova is not using this attachment record

    There are 3 operations in the actions REST API endpoint that can be used
    for an attack:

    - `os-terminate_connection`: Terminate volume attachment
    - `os-detach`: Detach a volume
    - `os-force_detach`: Force detach a volume

    In this endpoint we just won't allow most requests not coming from a
    service. The rules we apply are the same as for attachment delete
    explained earlier, but in this case we may not have the attachment id
    and be more restrictive. This should not be a problem for normal
    operations because:

    - Cinder backup doesn't use the REST API but RPC calls via RabbitMQ
    - Glance doesn't use this interface

    Checking whether it's a service or not is done at the cinder-api level
    by checking that the service user that made the call has at least one of
    the roles in the `service_token_roles` configuration. These roles are
    retrieved from keystone by the keystone middleware using the value of
    the "X-Service-Token" header.

    If Cinder is configured with `service_token_roles_required = true` and
    an attacker provides non-service valid credentials the service will
    return a 401 error, otherwise it'll return 409 as if a normal user had
    made the call without the service token.

    Closes-Bug: #2004555
    Change-Id: I612905a1bf4a1706cce913c0d8a6df7a240d599a
    (cherry picked from commit 6df1839bdf288107c600b3e53dff7593a6d4c161)
    Conflicts:
            cinder/exception.py
    (cherry picked from commit dd6010a9f7bf8cbe0189992f0848515321781747)
    (cherry picked from commit cb4682fb836912225c5da1536108a0d05fd5c46e)
    Conflicts:
            cinder/exception.py
    (cherry picked from commit a66f4afa22fc5a0a85d5224a6b63dd766fef47b1)
    Conflicts:
            cinder/compute/nova.py
            cinder/tests/unit/attach...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/883360
Committed: https://opendev.org/openstack/cinder/commit/1101402b8fda7423b41b2f2e078f8f5a1d2bb4bd
Submitter: "Zuul (22348)"
Branch: master

commit 1101402b8fda7423b41b2f2e078f8f5a1d2bb4bd
Author: Gorka Eguileor <email address hidden>
Date: Wed May 17 13:42:41 2023 +0200

    Doc: Improve service token

    This patch extends a bit the documentation for the service token
    configuration, since there have been complains about its clarity and
    completeness.

    Related-Bug: #2004555
    Change-Id: Id89497d068c1644e4615fc0fb85c4d1a139ecc19

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/884571

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to os-brick (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/882848
Committed: https://opendev.org/openstack/os-brick/commit/70493735d2f99523c4a23ecbeed15969b2e81f6b
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 70493735d2f99523c4a23ecbeed15969b2e81f6b
Author: Gorka Eguileor <email address hidden>
Date: Wed Mar 1 13:08:16 2023 +0100

    Support force disconnect for FC

    This patch adds support for the force and ignore_errors on the
    disconnect_volume of the FC connector like we have in the iSCSI
    connector.

    Related-Bug: #2004555
    Change-Id: Ia74ecfba03ba23de9d30eb33706245a7f85e1d66
    (cherry picked from commit 570df49db9de3030e658619138588b836c007f8c)
    Conflicts:
            os_brick/initiator/connectors/fibre_channel.py
    (cherry picked from commit 111b3931a2db1d5be4ebe704bf26c34fa9408483)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/cinder/+/885553

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/cinder/+/885554

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/cinder/+/885555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/cinder/+/885556

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to os-brick (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/os-brick/+/885558

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to os-brick (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/os-brick/+/885559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to os-brick (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/os-brick/+/885560

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to os-brick (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/os-brick/+/885561

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to os-brick (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/885558
Committed: https://opendev.org/openstack/os-brick/commit/5dcda6b961fa765c817f94a782a6fff48295c89a
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 5dcda6b961fa765c817f94a782a6fff48295c89a
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:29:20 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/wallaby, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the os-brick code.

    Change-Id: I6345a5a3a7c08c88233b47806c28284fa2dd87d3
    Related-bug: #2004555

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to os-brick (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/885560
Committed: https://opendev.org/openstack/os-brick/commit/2845871c87fc4e6384bd16d81832cc71e2fb0d61
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 2845871c87fc4e6384bd16d81832cc71e2fb0d61
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:29:20 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/ussuri, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the os-brick code.

    Change-Id: Ie54cfc6697b4e54d37fd66dbad2ff20971399c00
    Related-bug: #2004555

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to os-brick (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/885559
Committed: https://opendev.org/openstack/os-brick/commit/78a0ea24a586139343c98821f9914901f1b5ec5b
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 78a0ea24a586139343c98821f9914901f1b5ec5b
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:29:20 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/victoria, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the os-brick code.

    Change-Id: I37da3be26c7099307b46ae6b6320a3de7658e106
    Related-bug: #2004555

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to os-brick (stable/train)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/885561
Committed: https://opendev.org/openstack/os-brick/commit/0cc7019eec2b58f507905d52370a74eb80613b99
Submitter: "Zuul (22348)"
Branch: stable/train

commit 0cc7019eec2b58f507905d52370a74eb80613b99
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:29:20 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/train, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the os-brick code.

    Change-Id: I6d04c164521b72538665f53ab62250b14b2710fe
    Related-bug: #2004555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (stable/train)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/885556
Committed: https://opendev.org/openstack/cinder/commit/299553a4fe281cde9b14da34a470dcdb3ed17cc0
Submitter: "Zuul (22348)"
Branch: stable/train

commit 299553a4fe281cde9b14da34a470dcdb3ed17cc0
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:01:12 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/train, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the cinder code.

    Change-Id: I1621e3d3d9272a7a25b2d9d9e6710efb6b637a89
    Related-bug: #2004555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/885554
Committed: https://opendev.org/openstack/cinder/commit/63d7848a9548180d283a833beb7c5718e0ad0bdb
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 63d7848a9548180d283a833beb7c5718e0ad0bdb
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:01:12 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/victoria, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the cinder code.

    Change-Id: I2866b0ca1511a53b096b73bbe51a74588cdd8947
    Related-bug: #2004555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/885555
Committed: https://opendev.org/openstack/cinder/commit/60f705d722fc6b7c434194a9f3b11595294d6aa0
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 60f705d722fc6b7c434194a9f3b11595294d6aa0
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:01:12 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/ussuri, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the cinder code.

    Change-Id: I5c55ab7ca6c85d23c5ab7d2d383a18226735aaf2
    Related-bug: #2004555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/885553
Committed: https://opendev.org/openstack/cinder/commit/2fef6c41fa8c5ea772cde227a119dcf22ce7a07d
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 2fef6c41fa8c5ea772cde227a119dcf22ce7a07d
Author: Brian Rosmaita <email address hidden>
Date: Wed Jun 7 18:01:12 2023 -0400

    [stable-em-only] Add CVE-2023-2088 warning

    The Cinder project team does not intend to backport a fix for
    CVE-2023-2088 to stable/wallaby, so add a warning to the README
    so that consumers are aware of the vulnerability of this branch
    of the cinder code.

    Change-Id: I83b5232076250553650b8b97409cbf72e90c15b9
    Related-bug: #2004555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 23.0.0.0rc1

This issue was fixed in the openstack/cinder 23.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 28.0.0.0rc1

This issue was fixed in the openstack/nova 28.0.0.0rc1 release candidate.

Displaying first 40 and last 40 comments. View all 271 comments or add a comment.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.