Bug #1990157 “OSSN-0090: Malicious image data modification can h...” : Series yoga : Bugs : Glance

Revision history for this message

Erno Kuvaja (jokke) wrote on 2022-09-19:

#1

I've added Pranali into the bug as new Glance PTL and she is aware of the issue.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2022-09-19:

#2

Since this report concerns a possible security risk, an incomplete
security advisory task has been added while the core security
reviewers for the affected project or projects confirm the bug and
discuss the scope of any vulnerability along with potential
solutions.

description:	updated
Changed in ossa:
status:	New → Incomplete

Revision history for this message

Jeremy Stanley (fungi) wrote on 2022-09-19:

#3

Was this way of triggering the defective behavior present even at the time OSSA 2016-006 was published and nobody realized it, or did subsequent changes to Glance or other projects re-expose the vulnerability some time later?

Revision history for this message

Brian Rosmaita (brian-rosmaita) wrote on 2022-09-19:

#4

@Jeremy: I think it's fair to say that this issue was present at the time of OSSA 2016-006 but nobody realized it ... that's because after the initial bug was fixed, a subsequent bug was filed that did identify the issue. Here's the history of this problem:

1. "[OSSA 2016-006] Normal user can change image status if show_multiple_locations has been set to true (CVE-2016-0757)"
https://bugs.launchpad.net/glance/+bug/1525915
what: create image, upload data (checksum is set), delete all locations, image goes to 'queued' status, can upload malicious data via the API in the normal way
- checksum can't be modified (although checksum is md5 until Rocky)
- can only do it with an image you own (or if you are admin)
- fix: don't allow the image location deletion if the result would be no locations

2. "Normal user can replace active image data if show_multiple_locations has been set to true"
https://bugs.launchpad.net/ossn/+bug/1549483
what: create image, upload data (checksum is set), add a new location pointing to malicious data, delete all non-malicious locations
- checksum can't be modified (although checksum is md5 until Rocky)
- can only do it with an image you own (or if you are admin)
- fix: OSSN-0065 recommends do not configure glance with show_multiple_locations=true

3. In Rocky (glance 17.0.0, Tagged on 2018-08-30 14:06:53 +0000)
- introduction of glance "multihash" ... like legacy "checksum", os_hash_algo/os_hash_value can't be modified, and the default algo is sha512
- if the "multihash" is checked on download, data substitution from Bug #1549483 will be revealed

4. In Rocky (glance 17.0.1, Tagged on 2020-03-19 12:26:00 +0000)
- https://docs.openstack.org/releasenotes/glance/rocky.html#relnotes-17-0-1-stable-rocky
- Known Issue: "The workaround is to continue to use the show_multiple_locations option in a dedicated “internal” Glance node that is not accessible to end users. We continue to recommend that image locations not be exposed to end users."

5. "Malicious image data modification can happen when using COW"
https://bugs.launchpad.net/glance/+bug/1990157 (this bug)
what: create an image, upload data ("multihash" is set), boot a server from the image, ask nova to create an image from the server. This image will not have "multihash" info set. Then do the attack from Bug #1549483, probably by uploading a malicious image to glance (to get the data into the backend); then add the location to the un-checksummed image and delete the original location from the un-checksummed image
- can't mitigate with a "multihash" check (because there isn't one)
- can only do it with an image you own (or if you are admin)

@Jeremy: I think it's fair to say that this issue was present at the time of OSSA 2016-006 but nobody realized it ... that's because after the initial bug was fixed, a subsequent bug was filed that did identify the issue.  Here's the history of this problem:

1. "[OSSA 2016-006] Normal user can change image status if show_multiple_locations has been set to true (CVE-2016-0757)"
https://bugs.launchpad.net/glance/+bug/1525915
what: create image, upload data (checksum is set), delete all locations, image goes to 'queued' status, can upload malicious data via the API in the normal way
- checksum can't be modified (although checksum is md5 until Rocky)
- can only do it with an image you own (or if you are admin)
- fix: don't allow the image location deletion if the result would be no locations

2. "Normal user can replace active image data if show_multiple_locations has been set to true"
https://bugs.launchpad.net/ossn/+bug/1549483
what: create image, upload data (checksum is set), add a new location pointing to malicious data, delete all non-malicious locations
- checksum can't be modified (although checksum is md5 until Rocky)
- can only do it with an image you own (or if you are admin)
- fix: OSSN-0065 recommends do not configure glance with show_multiple_locations=true

3. In Rocky (glance 17.0.0, Tagged on 2018-08-30 14:06:53 +0000)
- introduction of glance "multihash" ... like legacy "checksum", os_hash_algo/os_hash_value can't be modified, and the default algo is sha512
- if the "multihash" is checked on download, data substitution from Bug #1549483 will be revealed

4. In Rocky (glance 17.0.1, Tagged on 2020-03-19 12:26:00 +0000)
- https://docs.openstack.org/releasenotes/glance/rocky.html#relnotes-17-0-1-stable-rocky
- Known Issue: "The workaround is to continue to use the show_multiple_locations option in a dedicated “internal” Glance node that is not accessible to end users. We continue to recommend that image locations not be exposed to end users."

5. "Malicious image data modification can happen when using COW"
https://bugs.launchpad.net/glance/+bug/1990157 (this bug)
what: create an image, upload data ("multihash" is set), boot a server from the image, ask nova to create an image from the server.  This image will not have "multihash" info set.  Then do the attack from Bug #1549483, probably by uploading a malicious image to glance (to get the data into the backend); then add the location to the un-checksummed image and delete the original location from the un-checksummed image
- can't mitigate with a "multihash" check (because there isn't one)
- can only do it with an image you own (or if you are admin)

Revision history for this message

Erno Kuvaja (jokke) wrote on 2022-09-20 (last edit on 2022-09-20):

#5

@Jeremy & @Brian I'm not convinced, but can't say for sure, that this was existing when the original bug was handled in 2016. Lots of the vectors that were problematic at the time got plugged.

The problem really is that since that we've introduced "Community" visibility, at least Cinder COW paths, and I think the nova direct snapshotting, which of all are very much expanding the old exposure. Especially as the COW operations do require 'show_multiple_locations=true' & 'show_image_direct_url=true' which were the majority of discussion during OSSA-0065 (Brian's #2) fixing and mitigating the issues.

Basically all deployments with Ceph are vulnerable and the users are shrugging the warnings off "as it's the default with Ceph" while these COW models never even tried to address the elephant in the room.

The multihash would help to identify the exploitation if it was present in all images, but like said that is not the case with direct snapshotting. (One of the reasons why I wanted to bring this up as a new bug). The other part which is very worrying is that the COW style consumers do not check the hash even if it was present like mentioned in my bug description, which might have been overlooked during the original OSSA-2016-006, but I'm not sure if that was even implemented yet at the time.

While the recommendation (Brian #4) landed in Rocky release notes, it never made it's way to any of the other documentation highlighting any of these issues actually being present if the separation of gapi nodes had not been done. Good indicator of this confusion is that at least TripleO, DevStack and I think OSA are all deploying only one set of gapi to serve both the user and internal services.

So lots of the attack vectors did not exists when the mechanism was identified in 2016 and was flagged as solved at the time.

Revision history for this message

Erno Kuvaja (jokke) wrote on 2022-09-20:

#6

There is fairly simple solution for this too.

We could kick off asynchronous task (all the piping is there so only the taskflow and triggering for it would need to be implemented) when ever there is location added to the image from locations API for Glance to go and read the data and calculate the multihash of it. If the image was new, like Nova direct snapshot or someone creating image with http-store, we would add the multihash to the metadata (unlike we do now). If the image was existing one, we could validate the hash against the existing metadata.

it would not help in the COW cases where the hash is not verified upon consumption, but it would plug any easy way to replace the existing data with new one. One would need to have access to the actual storage.

Revision history for this message

Brian Rosmaita (brian-rosmaita) wrote on 2022-09-21:

#7

The point of my comment #4 was that this exploit is a fairly obvious implication of items 1-4 in that comment. Thus, I think there's no point in keeping this embargoed as a private security bug, and we should go ahead and make it public.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2022-09-21:

#8

Thanks Brian. The other things to consider are:

1. If the bug is impractical for an attacker to successfully exploit, we should work on this in public.

2. If the solutions to the bug are not safely backportable to maintained stable branches, we should work on this in public.

3. If solutions to the bug are very likely to take longer than three months (our maximum embargo duration) to implement and/or would be very hard to develop in secret, we should work on this in public.

Revision history for this message

Dan Smith (danms) wrote on 2022-09-21:

#9

Erno, that's a good idea to re-hash the image when we add the location, but:

1. We'd probably want some way to not expose that location until it's verified, otherwise we've just shrunk the window. I don't think we have a "state" for each location today, do we?

2. It would probably be difficult to expose the result of a failure. Accidental tripping of the hash failure would result in locations just disappearing (I assume) which could be confusing.

3. We'd need some way to recover the process if we're interrupted in the middle of reading a very large image. If we had #1, then perhaps the location would never become usable, but we'd need to cover that case.

4. We'd end up with a very large additional load on glance-api that isn't there today, as all the (currently very lightweight) snapshotting of nova instances would cause a lot of additional CPU and network load to glance. Right now, you can snapshot your instances very frequently on ceph because it's so lightweight, and people do. On an edge deployment where you've got a local glance, that could end up overwhelming what is actually just a very lightweight set of glance workers for the (current) purpose of just finding images in RBD.

So, it might be worth doing this anyway, but I think we'd want to hide that behind a "verify_additional_locations=(True|False)" flag for the security-conscious. For people that don't want that level of additional work, deploying glance internal and public would be a lot better I think as it closes the hole and avoids a bunch of additional work on each snapshot.

Revision history for this message

Dan Smith (danms) wrote on 2022-09-21:

#10

Jeremy, the other thing to consider is that the real (current, immediate) fix to this is one of deployment and not so much a change to glance itself. So, OSA might want to make a change, which I guess could be covered under the embargo period, but I don't think we should expect a glance change to be released as a result of this.

Thus, publicizing it would let operators change their deployments (which many of them can probably do right away) to eliminate the concern sooner than later.

Revision history for this message

Brian Rosmaita (brian-rosmaita) wrote on 2022-09-22:

#11

The glance team discussed this extensively at a PTG a few years ago, and rejected background computation of the hash largely because the time/load are exactly what operators are trying to avoid by using a common ceph to back nova and glance (Dan's point 4).

Plus, if the hash isn't going to be checked by nova when the image is consumed, recomputing it for each location seems kind of pointless (all we would know is that the data at location L had hash H at the time glance checked and allowed L to be added to the image, but if someone changes the data at L later (not the location uri, changes the actual data ... doesn't have to be malicious, could just be some kind of failure in the backend), there's no way to know without downloading and hashing the image. So I don't think that a verify_additional_locations option would actually increase security.

What could increase security would be to allow images that have the img_signature* properties on them to *always* go through the validation path. However, last time I looked, image signature verification is only available for the libvirt compute driver (not sure that's a big deal) and when NOT using the rbd image backend (which is the backend we're using here). But I think forcing the "normal" data path for signed images would not increase load too much (it's a PITA for users to set up an image that can be verified), and you would have the check done at the point of consumption, which is really where you want it. (This is the case for cinder, too; when using the cinder glance_store, the image-volume is cloned directly in the backend.)

Or, now that we have glance multi-store, an operator could use a second (non-rbd, non-cinder) store and tell end users to put all signed images into that store. (I think this would still require code changes in nova ... last I looked, if you turn verify_glance_signatures on for nova, if the img_signature* properties are incomplete or missing, nova puts the instance into an error state. I think we should look into changing this, though, so that there is some way to do image verification when nova and glance are using the same rbd backend. Cinder doesn't have the all-or-nothing signature verification setting, so I think this dedicated store for signed images plan could work for cinder with no code change.)

In any case, I think we should offer some way to guarantee (for the users who want it) that a consumed image is verified at the point of use. But that's not directly related to this bug.

For this bug right now, I think the situation is:
- If you don't expose locations on any end-user-facing glance-api, end users cannot modify locations via the Images API.
- If you do expose locations to end users, end users can modify them, but only on their own images. So if you set the policies for publicize_image, communitize_image, and add_member to be admin-only, an end user cannot spread a malicious image outside of their own project.

The glance team discussed this extensively at a PTG a few years ago, and rejected background computation of the hash largely because the time/load are exactly what operators are trying to avoid by using a common ceph to back nova and glance (Dan's point 4).

Plus, if the hash isn't going to be checked by nova when the image is consumed, recomputing it for each location seems kind of pointless (all we would know is that the data at location L had hash H at the time glance checked and allowed L to be added to the image, but if someone changes the data at L later (not the location uri, changes the actual data ... doesn't have to be malicious, could just be some kind of failure in the backend), there's no way to know without downloading and hashing the image.  So I don't think that a verify_additional_locations option would actually increase security.

What could increase security would be to allow images that have the img_signature* properties on them to *always* go through the validation path.  However, last time I looked, image signature verification is only available for the libvirt compute driver (not sure that's a big deal) and when NOT using the rbd image backend (which is the backend we're using here).  But I think forcing the "normal" data path for signed images would not increase load too much (it's a PITA for users to set up an image that can be verified), and you would have the check done at the point of consumption, which is really where you want it.  (This is the case for cinder, too; when using the cinder glance_store, the image-volume is cloned directly in the backend.)

Or, now that we have glance multi-store, an operator could use a second (non-rbd, non-cinder) store and tell end users to put all signed images into that store.  (I think this would still require code changes in nova ... last I looked, if you turn verify_glance_signatures on for nova, if the img_signature* properties are incomplete or missing, nova puts the instance into an error state.  I think we should look into changing this, though, so that there is some way to do image verification when nova and glance are using the same rbd backend.  Cinder doesn't have the all-or-nothing signature verification setting, so I think this dedicated store for signed images plan could work for cinder with no code change.)

In any case, I think we should offer some way to guarantee (for the users who want it) that a consumed image is verified at the point of use.  But that's not directly related to this bug.

For this bug right now, I think the situation is:
- If you don't expose locations on any end-user-facing glance-api, end users cannot modify locations via the Images API.
- If you do expose locations to end users, end users can modify them, but only on their own images.  So if you set the policies for publicize_image, communitize_image, and add_member to be admin-only, an end user cannot spread a malicious image outside of their own project.

Revision history for this message

Erno Kuvaja (jokke) wrote on 2022-09-22:

#12

@Dan we do have status on the locations table and we do update it. I don't think we consume it anywhere currently but it's there. Don't we also have the error metadata key where we already report errors from the import tasks, maybe we can reuse that?

Honestly I have not looked into the performance of the current hashing functions. So I cannot say yei or nei if the load would be significant on some snapshotting interval.

@Brian just to make clear, so we don't give wrong exposure assumption here. I'm taking you are referring to the PTG discussion within Glance Security core at the time which resulted to that release note that we still advice the dual deployment model. This was not discussed in depth with any wider audience at the time.

IIRC the hashing operation at the time was expected to be synchronous and we did not want to hold the client hostage on location call while we do it.

In hindsight our decision around (I think it was) OSSA-0065 when we decided on the blanket principle of "This covers all the future iterations of this location swap vulnerability" was wrong. It has became clear over the years that the releasenote gets ignored, we do not have the approach properly documented anywhere in the deployment guide and there's been clearly rather expansion of the issue than more limited situation in the field. And we get on frequent intervals people asking why this issue is still not plugged (like his time our RH Leadership Team).

I tend to agree with you that making yet another config option will not increase the security of the service. When the very first "We've fixed this by not allowing deletion of the last image location" was released we had no such scenario that we wouldn't have image hash in place and that was supposed to catch it. And we cannot blame any consumer not verifying it before we make sure we have that metadata available. I think it is key requirement for our promise of immutable images, being the change malicious act or accidental corruption etc.

I do think it being unreasonable expectation for clouds to policy out community images because of this when it was promoted as the user friendly solution for not allowing everyone publicize images. Just like it would be to turn of image sharing as well. Just because we can't be arsed to fix the underlying issues in our code.

@All Especially as any backend detail leakage is frowned upon the community, I think we still need to reiterate the importance of deploying public and internal gapi services. But clearly that is not the right solution to the fact that it's fairly trivial to change the image payload and the hashless images makes it impossible to detect. And past has proven that deployments don't do it anyways. When the Ceph deployments are North of 60% of all of the OpenStack deployments, the hashless images are by no means a corner case. If we plug that hole, then the decision is anymore "Do we care exposing what backend stores are in use or not" not if that risks the integrity of the images too.