OSSN-0090: Malicious image data modification can happen when using COW
- Series yoga
- Bug #1990157
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Glance | Status tracked in Zed | |||||
Wallaby |
New
|
Undecided
|
Unassigned | |||
Xena |
New
|
Undecided
|
Unassigned | |||
Yoga |
New
|
Undecided
|
Unassigned | |||
Zed |
New
|
Critical
|
Unassigned | |||
OpenStack Security Notes |
In Progress
|
Undecided
|
Brian Rosmaita |
Bug Description
The location fix manipulation mentioned in https:/
The fix at the time was able to ensure that the hash stayed in the image record causing validations to fail even if the image data itself had changed.
Different efforts to speed up booting and volume creation processes utilizing Copy On Write behaviour of Ceph or various Cinder backends is causing two different scenarios where malicious content can be sneaked into the image.
1) When Nova creates snapshot directly into the Ceph store and creates image through the API adding the location via locations API rather than uploading the data to Glance it omits 2 parts of metadata that would allow identifying alteration of the image data. The image does not have multihash (or checksum) associated with it making validation impossible. The image does not have size metadata either.
2) When image is consumed in COW manner even if it had multihash (checksum) registered in the metadata it does not get validated as the consumer does not read the whole image and calculate checksum of it.
All current implementations of COW handling of the image depends of the direct url and locations API being exposed. As the services are accessing the image with user credentials if Glance is deployed with single API configuration serving both OpenStack services and end users, the malicious end user will have all the tools needed to make this attack valid. Only real mitigation for this issue is to deploy External API endpoint (for user access) and Internal API endpoint (for Openstack services, note that this endpoint needs to be firewalled off from end user access as the credentials are the same). Additional hardening of creating the multihash metadata entries and validating them upon use should be implemented. The dual API deployment should be highlighted clearly in the documentation.
These two behavioural facts causes the image manipulation mentioned in the OSSA-2016-006 (CVE 2016-0757) still possible.
If the image has multihash (checksum) recorded for it, python-glanceclient will reject the image if the data does not match. But this requires manual verification (actually downloading the image) to find out or deep understanding of the technical implementation to match the location URI with the image ID (in Ceph case). The COW consumers will not flag it to anyone and will just happily consume the modified data.
In the case that there is no multihash recorded for the image the only indication for malicious activity would be through comparing the location URI with the image ID (in Ceph case) and there is no other validation channels.
Once the location of the modified image data has been added to the image locations table, Glance will allow deleting the original data as that is not the last location anymore.
Erno Kuvaja (jokke) wrote : | #1 |
Jeremy Stanley (fungi) wrote : | #2 |
Since this report concerns a possible security risk, an incomplete
security advisory task has been added while the core security
reviewers for the affected project or projects confirm the bug and
discuss the scope of any vulnerability along with potential
solutions.
description: | updated |
Changed in ossa: | |
status: | New → Incomplete |
Jeremy Stanley (fungi) wrote : | #3 |
Was this way of triggering the defective behavior present even at the time OSSA 2016-006 was published and nobody realized it, or did subsequent changes to Glance or other projects re-expose the vulnerability some time later?
Brian Rosmaita (brian-rosmaita) wrote : | #4 |
@Jeremy: I think it's fair to say that this issue was present at the time of OSSA 2016-006 but nobody realized it ... that's because after the initial bug was fixed, a subsequent bug was filed that did identify the issue. Here's the history of this problem:
1. "[OSSA 2016-006] Normal user can change image status if show_multiple_
https:/
what: create image, upload data (checksum is set), delete all locations, image goes to 'queued' status, can upload malicious data via the API in the normal way
- checksum can't be modified (although checksum is md5 until Rocky)
- can only do it with an image you own (or if you are admin)
- fix: don't allow the image location deletion if the result would be no locations
2. "Normal user can replace active image data if show_multiple_
https:/
what: create image, upload data (checksum is set), add a new location pointing to malicious data, delete all non-malicious locations
- checksum can't be modified (although checksum is md5 until Rocky)
- can only do it with an image you own (or if you are admin)
- fix: OSSN-0065 recommends do not configure glance with show_multiple_
3. In Rocky (glance 17.0.0, Tagged on 2018-08-30 14:06:53 +0000)
- introduction of glance "multihash" ... like legacy "checksum", os_hash_
- if the "multihash" is checked on download, data substitution from Bug #1549483 will be revealed
4. In Rocky (glance 17.0.1, Tagged on 2020-03-19 12:26:00 +0000)
- https:/
- Known Issue: "The workaround is to continue to use the show_multiple_
5. "Malicious image data modification can happen when using COW"
https:/
what: create an image, upload data ("multihash" is set), boot a server from the image, ask nova to create an image from the server. This image will not have "multihash" info set. Then do the attack from Bug #1549483, probably by uploading a malicious image to glance (to get the data into the backend); then add the location to the un-checksummed image and delete the original location from the un-checksummed image
- can't mitigate with a "multihash" check (because there isn't one)
- can only do it with an image you own (or if you are admin)
Erno Kuvaja (jokke) wrote (last edit ): | #5 |
@Jeremy & @Brian I'm not convinced, but can't say for sure, that this was existing when the original bug was handled in 2016. Lots of the vectors that were problematic at the time got plugged.
The problem really is that since that we've introduced "Community" visibility, at least Cinder COW paths, and I think the nova direct snapshotting, which of all are very much expanding the old exposure. Especially as the COW operations do require 'show_multiple_
Basically all deployments with Ceph are vulnerable and the users are shrugging the warnings off "as it's the default with Ceph" while these COW models never even tried to address the elephant in the room.
The multihash would help to identify the exploitation if it was present in all images, but like said that is not the case with direct snapshotting. (One of the reasons why I wanted to bring this up as a new bug). The other part which is very worrying is that the COW style consumers do not check the hash even if it was present like mentioned in my bug description, which might have been overlooked during the original OSSA-2016-006, but I'm not sure if that was even implemented yet at the time.
While the recommendation (Brian #4) landed in Rocky release notes, it never made it's way to any of the other documentation highlighting any of these issues actually being present if the separation of gapi nodes had not been done. Good indicator of this confusion is that at least TripleO, DevStack and I think OSA are all deploying only one set of gapi to serve both the user and internal services.
So lots of the attack vectors did not exists when the mechanism was identified in 2016 and was flagged as solved at the time.
Erno Kuvaja (jokke) wrote : | #6 |
There is fairly simple solution for this too.
We could kick off asynchronous task (all the piping is there so only the taskflow and triggering for it would need to be implemented) when ever there is location added to the image from locations API for Glance to go and read the data and calculate the multihash of it. If the image was new, like Nova direct snapshot or someone creating image with http-store, we would add the multihash to the metadata (unlike we do now). If the image was existing one, we could validate the hash against the existing metadata.
it would not help in the COW cases where the hash is not verified upon consumption, but it would plug any easy way to replace the existing data with new one. One would need to have access to the actual storage.
Brian Rosmaita (brian-rosmaita) wrote : | #7 |
The point of my comment #4 was that this exploit is a fairly obvious implication of items 1-4 in that comment. Thus, I think there's no point in keeping this embargoed as a private security bug, and we should go ahead and make it public.
Jeremy Stanley (fungi) wrote : | #8 |
Thanks Brian. The other things to consider are:
1. If the bug is impractical for an attacker to successfully exploit, we should work on this in public.
2. If the solutions to the bug are not safely backportable to maintained stable branches, we should work on this in public.
3. If solutions to the bug are very likely to take longer than three months (our maximum embargo duration) to implement and/or would be very hard to develop in secret, we should work on this in public.
Dan Smith (danms) wrote : | #9 |
Erno, that's a good idea to re-hash the image when we add the location, but:
1. We'd probably want some way to not expose that location until it's verified, otherwise we've just shrunk the window. I don't think we have a "state" for each location today, do we?
2. It would probably be difficult to expose the result of a failure. Accidental tripping of the hash failure would result in locations just disappearing (I assume) which could be confusing.
3. We'd need some way to recover the process if we're interrupted in the middle of reading a very large image. If we had #1, then perhaps the location would never become usable, but we'd need to cover that case.
4. We'd end up with a very large additional load on glance-api that isn't there today, as all the (currently very lightweight) snapshotting of nova instances would cause a lot of additional CPU and network load to glance. Right now, you can snapshot your instances very frequently on ceph because it's so lightweight, and people do. On an edge deployment where you've got a local glance, that could end up overwhelming what is actually just a very lightweight set of glance workers for the (current) purpose of just finding images in RBD.
So, it might be worth doing this anyway, but I think we'd want to hide that behind a "verify_
Dan Smith (danms) wrote : | #10 |
Jeremy, the other thing to consider is that the real (current, immediate) fix to this is one of deployment and not so much a change to glance itself. So, OSA might want to make a change, which I guess could be covered under the embargo period, but I don't think we should expect a glance change to be released as a result of this.
Thus, publicizing it would let operators change their deployments (which many of them can probably do right away) to eliminate the concern sooner than later.
Brian Rosmaita (brian-rosmaita) wrote : | #11 |
The glance team discussed this extensively at a PTG a few years ago, and rejected background computation of the hash largely because the time/load are exactly what operators are trying to avoid by using a common ceph to back nova and glance (Dan's point 4).
Plus, if the hash isn't going to be checked by nova when the image is consumed, recomputing it for each location seems kind of pointless (all we would know is that the data at location L had hash H at the time glance checked and allowed L to be added to the image, but if someone changes the data at L later (not the location uri, changes the actual data ... doesn't have to be malicious, could just be some kind of failure in the backend), there's no way to know without downloading and hashing the image. So I don't think that a verify_
What could increase security would be to allow images that have the img_signature* properties on them to *always* go through the validation path. However, last time I looked, image signature verification is only available for the libvirt compute driver (not sure that's a big deal) and when NOT using the rbd image backend (which is the backend we're using here). But I think forcing the "normal" data path for signed images would not increase load too much (it's a PITA for users to set up an image that can be verified), and you would have the check done at the point of consumption, which is really where you want it. (This is the case for cinder, too; when using the cinder glance_store, the image-volume is cloned directly in the backend.)
Or, now that we have glance multi-store, an operator could use a second (non-rbd, non-cinder) store and tell end users to put all signed images into that store. (I think this would still require code changes in nova ... last I looked, if you turn verify_
In any case, I think we should offer some way to guarantee (for the users who want it) that a consumed image is verified at the point of use. But that's not directly related to this bug.
For this bug right now, I think the situation is:
- If you don't expose locations on any end-user-facing glance-api, end users cannot modify locations via the Images API.
- If you do expose locations to end users, end users can modify them, but only on their own images. So if you set the policies for publicize_image, communitize_image, and add_member to be admin-only, an end user cannot spread a malicious image outside of their own project.
Erno Kuvaja (jokke) wrote : | #12 |
@Dan we do have status on the locations table and we do update it. I don't think we consume it anywhere currently but it's there. Don't we also have the error metadata key where we already report errors from the import tasks, maybe we can reuse that?
Honestly I have not looked into the performance of the current hashing functions. So I cannot say yei or nei if the load would be significant on some snapshotting interval.
@Brian just to make clear, so we don't give wrong exposure assumption here. I'm taking you are referring to the PTG discussion within Glance Security core at the time which resulted to that release note that we still advice the dual deployment model. This was not discussed in depth with any wider audience at the time.
IIRC the hashing operation at the time was expected to be synchronous and we did not want to hold the client hostage on location call while we do it.
In hindsight our decision around (I think it was) OSSA-0065 when we decided on the blanket principle of "This covers all the future iterations of this location swap vulnerability" was wrong. It has became clear over the years that the releasenote gets ignored, we do not have the approach properly documented anywhere in the deployment guide and there's been clearly rather expansion of the issue than more limited situation in the field. And we get on frequent intervals people asking why this issue is still not plugged (like his time our RH Leadership Team).
I tend to agree with you that making yet another config option will not increase the security of the service. When the very first "We've fixed this by not allowing deletion of the last image location" was released we had no such scenario that we wouldn't have image hash in place and that was supposed to catch it. And we cannot blame any consumer not verifying it before we make sure we have that metadata available. I think it is key requirement for our promise of immutable images, being the change malicious act or accidental corruption etc.
I do think it being unreasonable expectation for clouds to policy out community images because of this when it was promoted as the user friendly solution for not allowing everyone publicize images. Just like it would be to turn of image sharing as well. Just because we can't be arsed to fix the underlying issues in our code.
@All Especially as any backend detail leakage is frowned upon the community, I think we still need to reiterate the importance of deploying public and internal gapi services. But clearly that is not the right solution to the fact that it's fairly trivial to change the image payload and the hashless images makes it impossible to detect. And past has proven that deployments don't do it anyways. When the Ceph deployments are North of 60% of all of the OpenStack deployments, the hashless images are by no means a corner case. If we plug that hole, then the decision is anymore "Do we care exposing what backend stores are in use or not" not if that risks the integrity of the images too.
Brian Rosmaita (brian-rosmaita) wrote : | #13 |
I guess that I'm not being clear about my position here. What I'm saying is:
1. This exploit is a straightforward implication of knowledge that has been discussed publicly (even if not a lot of people paid attention to it; my point is that it's "out there"). So I think it's important to publish an OSS{AN} (whatever the VMT thinks is appropriate) to clue in/remind operators that there is a known vulnerability that they can take action about immediately, to wit:
- deploy separate internal/external glance-api so that multiple locations are not shown to end users
- or, if that looks too destabilizing to do instantly in a deployment, restrict the methods of image broadcast (publicize, communitize, share) until ^^ is done
2. Soon as we get the OSS{AN} published (which I think could happen next week), open this bug and use the PTG to discuss a long-term solution in public with any operators who care to attend. Since Red Hat in particular has an interest in OpenStack + Ceph configurations, we can reach out to some RH product managers who can attend and provide input, or will be able to get the word out to some large operators, who will hopefully provide direct input. There are some big tradeoffs here that we can't assess on our own. Right now, everything is aimed toward speed, and we heed help assessing how much of a slowdown people are willing to accept (if any), and under what circumstances.
3. I personally have never liked this non-checksummed image creation and consumption, but it's what operators have been willing to accept for performance. What I particularly don't like is that the current situation makes it *impossible* under some configurations of nova/glance/cinder to guarantee a verification chain for an image. If you don't use Ceph or the cinder glance_store, you are guaranteed a hash check of sha512 (or stronger, if the operator has configured it) at the point when the image is consumed. (IMO, this is just as strong as image signature verification [0], with none of the hassle for end users.) But this isn't available for some configurations, and maybe that's OK; it's an operator (and their customers, who can vote with their feet) choice. But maybe not everyone is aware of this choice (which will hopefully be addressed by item 1 above). Note that this is independent of the exploit discussed by this bug, which is malicious image provision via manipulation of glance's location record. Even if we re-do locations so that only nova can set them, there's still the issue of image data substitution in the backend without modifying the location uri recorded in glance.
4. I think an acceptable compromise would be to rely on image signature verification for deployment configurations that allow non-checksummed images. This is not an immediate solution because signature verification is not supported in those configurations where it would be really useful. It imposes a speed penalty, but it's also right in your face that you are making a speed/security tradeoff, because you (the end user) are adding a bunch of image metadata specifically for this purpose.
Additionally, the image-signature
Jeremy Stanley (fungi) wrote : | #14 |
If there won't be any patches accompanying the publication, then it would be an OSSN, but yes that plan sounds fine to me.
My grasp of the topic is, unfortunately, not solid enough to draft the explanation and operator guidance, so I'd be looking to someone with that background to write the relevant bits of prose. It can just be pasted into a comment on this bug initially, if folks are worried about this topic becoming more public before they have a chance to review it for accuracy.
Dan Smith (danms) wrote : | #15 |
> @Dan we do have status on the locations table and we do update it. I don't think we consume it
> anywhere currently but it's there.
Okay fair enough. I guess we could make nova look at that, but I'm not sure what we'd do really. Maybe only consider the non-hashing ones and potentially download the image and duplicate it in RBD on boots that complete before a new location is available. Could work, but also probably confusing for people that wonder why some of their instances aren't doing COW from the base image. Some thinking on how best to handle that is probably required.
> Don't we also have the error metadata key where we already report errors from the import tasks,
> maybe we can reuse that?
Yup, and I suppose we could, or another similar one for locations specifically. But if we've got status on a location, that it is probably enough.
> Plus, if the hash isn't going to be checked by nova when the image is consumed, recomputing it
> for each location seems kind of pointless
Making nova check the hash before the boot is definitely counterproductive to the goal of fast COW boots. However, I think (one of) the benefit(s) of what Erno is proposing is that glance checks it once for the benefit of everyone and then we can trust it going forward. I think it would generate an intolerable amount of glance CPU and network traffic, but it would certainly be less than every nova checking it every time.
I agree with Brian's #4 and Erno's @All above. The most important thing is getting the tribal knowledge of how glance should be deployed out to the wider audience. Other things we can do between glance, nova and cinder to make this better in the future will take time, and likely have drawbacks that make them undesirable and/or hard to backport. Even if we implement some better hashing and verification, I suspect a lot of people would prefer to deploy two glance APIs, rely on host-level security for who can add locations, and trust that images created from behind the firewall are what they say they are.
Erno Kuvaja (jokke) wrote : | #16 |
I think the asynchronous hash calculation can be approached 2 ways:
1) We don't let the image go to "active" before the multihash has been calculated. This would indeed break anything that currently depends on snapshot being immediately available and slow thing down significantly.
2) We let the new image behave just like it does now, but we also kick off the task to calculate the checksum and populate that to the image once it's done.
Then any additional location added via the API would be deferred until it's checksum has been verified. This is no extra work when the 'copy-image' import method is used as it's doing this already and at least I do not know any current usage where additional locations would be added to the image record "manually" apart from the said malicious activity.
As the COW consumers are not interested to check the hash anyways, I'd be n favour of #2 that would not be blocking the current legitimate usage patterns but would provide both user and Glance way to verify that the data is intact. Should consumers ever change their attitude on this, they could just have a logic that starts calculating their own hash and does the comparison once Glance has updated the record in it's database.
Brian Rosmaita (brian-rosmaita) wrote : | #17 |
@Jeremy: I'll post a draft OSSN later today as a comment in this bug.
affects: | ossa → ossn |
Brian Rosmaita (brian-rosmaita) wrote : | #18 |
Still working on the draft. In the meantime, found one more place documenting this issue (leaving it here so we remember to revise it later):
The help text for 'show_multiple_
# DEPRECATED:
# Show all image locations when returning an image.
#
# This configuration option indicates whether to show all the image
# locations when returning image details to the user. When multiple
# image locations exist for an image, the locations are ordered based
# on the location strategy indicated by the configuration opt
# ``location_
# image property ``locations``.
#
# NOTES:
# * Revealing image locations can present a GRAVE SECURITY RISK as
# image locations can sometimes include credentials. Hence, this
# is set to ``False`` by default. Set this to ``True`` with
# EXTREME CAUTION and ONLY IF you know what you are doing!
# * See https:/
# information.
# * If an operator wishes to avoid showing any image location(s)
# to the user, then both this option and
# ``show_
#
# Possible values:
# * True
# * False
#
# Related options:
# * show_image_
# * location_strategy
#
# (boolean value)
# This option is deprecated for removal since Newton.
# Its value may be silently ignored in the future.
# Reason: Use of this option, deprecated since Newton, is a security risk and
# will be removed once we figure out a way to satisfy those use cases that
# currently require it. An earlier announcement that the same functionality can
# be achieved with greater granularity by using policies is incorrect. You
# cannot work around this option via policy configuration at the present time,
# though that is the direction we believe the fix will take. Please keep an eye
# on the Glance release notes to stay up to date on progress in addressing this
# issue.
#show_multiple_
Also, here's the text for 'show_image_
#
# Show direct image location when returning an image.
#
# This configuration option indicates whether to show the direct image
# location when returning image details to the user. The direct image
# location is where the image data is stored in backend storage. This
# image location is shown under the image property ``direct_url``.
#
# When multiple image locations exist for an image, the best location
# is displayed based on the location strategy indicated by the
# configuration option ``location_
#
# NOTES:
# * Revealing image locations can present a GRAVE SECURITY RISK as
# image locations can sometimes include credentials. Hence, this
# is set to ``False`` by default. Set this to ``True`` with
# EXTREME CAUTION and ONLY IF you know what you are doing!
# * If an operator wishes to avoid showing any image location(s)
# to the user, then both this option and
# ``show_
#
# Possible values:
# * True
# * False
#
# Related options:
# * show_multiple_
# * location_stra...
Brian Rosmaita (brian-rosmaita) wrote : | #19 |
Here's a draft OSSN:
Best practices when configuring Glance with COW backends
---
### Summary ###
When deploying Glance in a popular configuration where Glance shares a common
storage backend with Nova and/or Cinder, it is possible to open some known
attack vectors by which malicious data modification can occur. This note
reviews the known issues and suggests a Glance deployment configuration that
can mitigate such attacks.
### Affected Services / Software ###
Glance, all releases
Nova, all releases
Cinder, all releases
### Discussion ###
This note applies to you if you are operating an end-user-facing
glance-api service with the 'show_multiple_
(the default value is False) or if your end-user-facing glance-api has
the 'show_image_
Your exposure is less if you have *only* 'show_image_
but the deployment configuration suggested below is recommended for your
case as well.
The attack vector was originally outlined in OSSN-0065 [0], though that
note was not clear about the attack surface or mitigation, and contained
some forward-looking statements that were not fulfilled. (Though it
does contain a useful discussion of image visibility and its associated
policy settings.)
The subject of OSSN-0065 is "Users of Glance may be able to replace
active image data", but it suggests that this is only an issue when
users do not checksum their image data. It neglects the fact that in
some popular deployment configurations in which Nova creates a root disk
snapshot, data is never uploaded to Glance, but instead a snapshot is
created directly in the backend and Nova creates a Glance image record
with size 0 and no os_hash_value [1], making it impossible to compare
the hash of downloaded image data to the value maintained by Glance.
Further, when Nova is so configured, Nova efficiently creates a root
disk directly in the backend without checksumming the image data (which
is not necessarily a flaw, it's the whole point of this configuration).
Similarly, when using a shared backend, or a cinder glance_store, Cinder
will efficiently clone a volume created from an image directly in the
backend without checksumming the image data.
The attack vector is the one outlined by OSSN-0065, namely:
[A] malicious user could create an image in Glance, set an additional
location on that image pointing to an altered image, then delete the
original location, so that consumers of the original image would
unwittingly be using the malicious image. Note, however, that this
attack vector cannot change the original image's checksum, and it is
limited to images that are owned by the attacker.
OSSN-0065 suggested that this attack vector could be addressed by using
policies, but that turned out not to be the case. The only way currently
to close this vector is to deploy an internal-
used by Nova and Cinder, with show_multiple_
end-user-facing glance-api with show_multiple_
This was suggested in "Known Issues" in Glance release notes in the
Rocky [2] through Ussuri releases, but it seems tha...
Dan Smith (danms) wrote : | #20 |
> but instead a snapshot is
> created directly in the backend and Nova creates a Glance image record
> with size 0 and no os_hash_value [1
I think it's important to call out that even if you have an image that was uploaded and a hash was calculated, someone could *later* change the data in the backend. Since nova doesn't (and can't really without a lot of extra work) know that the hash doesn't match the image it's about to fast clone, the hash might look like it's there, you know it *was* correct, but nova will not check it to see that it no longer matches.
> A glance-api service with 'show_multiple_
> *never* be exposed directly to end users. This setting should only
> be enabled on an internal-
> OpenStack services that require access to image locations.
I wonder if we should be more specific about "run two glance-apis with different config and use the public/internal endpoint types in keystone to differentiate. Also make sure the internal one is not accessible to the users (i.e. firewalled). You imply it with "never exposed to users" but...
Brian Rosmaita (brian-rosmaita) wrote : | #21 |
Revisions to comment #19:
### Affected Services / Software ###
-Glance, all releases
-Nova, all releases
-Cinder, all releases
+Glance, all supported and extended-
### Discussion ###
This note applies to you if you are operating an end-user-facing
@@ -67,19 +67,35 @@ is disabled in Glance, it is not possible to manipulate the locations
via the OpenStack Images API. It is worth mentioning, however, that
enabling 'show_image_
services to consume images directly from the storage backend) leaks
-information about the backend to end users, which is never a good thing
-from a security point of view. We therefore recommend that OpenStack
-services that require exposure of the 'direct_url' image property
-be similarly configured to use an internal-
-(End users who wish to download an image do not require access to the
-direct_url image property because they can simply use the image data
-download API call [3].)
+information about the backend to end users. What exactly that
+information consists of depends upon the backend in use and how it is
+configured, but in general, the safest course of action is not to expose
+it at all. Keep in mind that in any Glance/Nova/Cinder configuration
+where Nova and/or Cinder do copy-on-write directly in the image store,
+image data transfer takes place outside Glance's image data download
+path, and hence the os_hash_value is *not* checked. Thus, if the
+backend store is compromised, and image data is replaced directly in the
+backend, the substitution will *not* be detected. That's why it is
+important not to give malicious actors unnecessary hints about the image
+storage backend.
+
+We therefore recommend that OpenStack services that require exposure of
+the 'direct_url' image property be similarly configured to use an
+internal-
+image do not require access to the direct_url image property because
+they can simply use the image-data-download API call [3].)
### Recommended Actions ###
A glance-api service with 'show_multiple_
-*never* be exposed directly to end users. This setting should only
-be enabled on an internal-
-OpenStack services that require access to image locations.
+*never* be exposed directly to end users. This setting should only be
+enabled on an internal-
+services that require access to image locations. This could be done,
+for example, by running two glance-api services with different
+configuration files and using the appropriate configuration options for
+each service to specify the Image API endpoint to access, and making
+sure the special internal endpoint is firewalled in such a way that only
+the appropriate OpenStack services can contact it.
Brian Rosmaita (brian-rosmaita) wrote : | #22 |
About the Recommended Actions change: It's a bit more specific, but doesn't mention using the Keystone public/internal endpoint type for this purpose because the keystone docs [0] describe the 'internal' type as accessible to end users, and operators may already be using it in that way. I think the way to go is to use the Nova [glance]
[0] https:/
Erno Kuvaja (jokke) wrote : | #23 |
@Brian ok, let me make my stance on this very clear so it's on file and we can agree to disagree. I do think we have possibility to mitigate the issue from Glance side as well as the deployment front. And thus I do not agree with your rushed exposure of the details on this.
Obviously should you keep discussing this outside of the embargo and "agitate" it to public like you put it we probably will be on the wrong side of that. If we decide to ignore this issue from Glance side, we should at least give the courtecy of heads up to TripleO and OSA (are there other deplyoment projects under the umbrella?) before throwing them and all their users under the bus.
Brian Rosmaita (brian-rosmaita) wrote : | #24 |
@Erno: I am fine with holding this if you think we can have a resolution before 19 December, I guess. My concern is that this is an obvious attack vector -- all the code is available, and anyone scanning the config file sees "GRAVE SECURITY RISK" associated with the settings for show_multiple_
I'm not clear on what holding this gets us. The COW glance configuration is popular for space and time optimization, and I'm not sure what operators will accept. I really don't see the point of computing missing hash values if they're not being checked at the point of image data consumption, and that's exactly what operators don't want.
Anyway, let's continue to discuss this, being specific about the glance-side changes that would mitigate this. If we can fix and backport a good solution, I'm all for keeping this private while we get that done, though I really don't see the point of the privacy, because I think the exploit is already known.
Erno Kuvaja (jokke) wrote : | #25 |
Like I explained before in my comments, it would provide us 2 things:
No-one (say even someone with OpenStack admin role who has access to the internal endpoint) could swap the data through the locations API. They would need write access to the actual storage to do so. With Ceph which is clearly our biggest worry here, the location points to a snapshot that is read-only and would require again location update on the image to be modified closing even that vector.
The users would have mechanism to verify the image data (by downloading it, for example with glanceclient). Regardless if their preferred consumption method does that as part of the deployment workflow. This would be available for all images available to them; being it public, community or shared.
Like I said earlier, we made very poor assumptions when we expected this issue to just go away by deployment model advice during OSSN-0065 and lots of those assumptions were relying on the image having checksum and it being verified. We could at least make sure to have the hash calculated and verify it during all of our own operations and we can do that without breaking the current legitimate use cases.
My main concern is that we effectively reissue OSSN-0065, nothing changes as we sweep the security holes under the carpet again because it's convenient to push the responsibility to the operators.
Erno Kuvaja (jokke) wrote : | #26 |
Oh and just to be clear, I reported this as private as it's trivial to turn the discussion public once we are sure the discussion is not exposing vulnerabilities
So I'm all up for making this public if we are 100% sure there is nothing publicly new in this. Which I'm not at least convinced yet.
Brian Rosmaita (brian-rosmaita) wrote : | #27 |
OK, I think I am confusing two issues here:
1. The image-location-
2. Backend-
time t: end user requests nova createImage action
time t+1: glance posts os_hash_value
time t+2: end user downloads image and computes hash, OK
time t+3: end user requests nova to boot an instance from the image
If nova doesn't check the hash before booting the image, which it doesn't in the COW configuration, then how does the end user know that the image data hasn't been modified between t+2 and t+3? This exploit is facilitated by exposing the direct_url or locations on images to end users, so having an internal-
But it doesn't solve the larger issue of backend-
So, if we can get #1 done quickly in a way that doesn't kill the performance of hyperconverged infrastructure, then I am OK with keeping this private. I think that #2 is a real problem, however, that could use some discussion at the PTG. My question is whether we think the backend-
Erno Kuvaja (jokke) wrote : | #28 |
@Brian for your #2 I think cinder is still an issue. The image data in Ceph consists of the image "object/file", if you wish, named as Image-ID and read-only snapshot of it called snap. The location of RBD image in glance points to that snap of the object. So malicious user would need to replace that snap of the image to be able to change the image data, which is not possible if there is other references for that snap (say already running COW VM of it). One can modify the image data object and create new snapshot of it, but that would require update into the database which solving #1 would prevent.
Not bullet proof for every corner case, but heavily resistant compared to our present situation.
Dan Smith (danms) wrote : | #29 |
I think we really need to have anything like #1 available for all the supported branches if we're going to hold this up for that. I share Brian's concern on that being available in a timely manner. But I also think that it's not a reasonable resolution to the core problem because people using COW boots and snapshots are doing so specifically to *avoid* the need to do long and expensive operations there.
I think that the original OSSN did not clearly prescribe the way out of the box for this and as such we shouldn't use the lack of deployments using two endpoints as a gauge for whether or not people or deployment tools are aware of it. This originally got raised downstream when we were talking to deployment people and specifically asking about a split API horizon for this reason. They had no idea it was needed.
So again I'd say I think the far greater good is getting the information on how to mitigate this for all deployments out to the people. Changes to allow for tighter hashing controls in glance are good, but they're not going to be an acceptable solution for most of the affected users, I think. Deploying a second set of glance workers trades a little memory, which is a lot less expensive than the time and CPU load required for the hashing option.
Just MHO!
Pranali Deore (pranali-deore) wrote : | #30 |
Let's have a call to discuss this in terms of when to make it public and what to cover in the OSSN.
I'm adding some time slots options below, please let me know your availability,
Monday, 10th OCT - 14:00 UTC - 14:45 UTC?
Tuesday, 11th OCT - 14:00 UTC - 14:45 UTC?
I think 45 mins would be enough but we can stretch it if required or we can conclude early as well.
Jeremy Stanley (fungi) wrote : | #31 |
I'm free for all of those except 15z on Tuesday, thanks!
Dan Smith (danms) wrote : | #32 |
I can make any of those work.
Abhishek Kekane (abhishek-kekane) wrote : | #33 |
Monday 10th will be good for me, incase I can adjust on 11th if required. Thanks!
Brian Rosmaita (brian-rosmaita) wrote : | #34 |
Monday is best for me.
Pranali Deore (pranali-deore) wrote : | #35 |
Thanks everyone !! I've scheduled the meeting today, 10th OCT at 1400 UTC.
Erno Kuvaja (jokke) wrote : | #36 |
Jeremy Stanley (fungi) wrote : | #37 |
To summarize my takeaway from the call, the risk of exploit in basically all cases boils down to some trusted account "going rogue" and substituting a malicious image (perhaps after validation by the consumer), with their actions going entirely unnoticed. The currently proposed patch represents a new feature in Glance of the level that would normally require a formal specification and trigger broad discussion around API behavior changes and potential performance regressions. I don't think the risks presented outweigh the need for public design process around the proposed feature, so I'm recommending we switch this bug to public once the participants here are comfortable with the drafted guidance to operators, and then proceed with the code changes in public review where it can be better scrutinized and more thoroughly tested.
Brian Rosmaita (brian-rosmaita) wrote : | #38 |
- Latest version of the OSSN (6 Oct 2022) Edit (8.2 KiB, text/plain)
Latest version of the OSSN (6 Oct 2022)
Erno Kuvaja (jokke) wrote : | #39 |
As this addresses a known issue, it is not an embargoed note concerning
+a zero-day exploit. If, however, you are learning about this for the
+first time, and you are exposing image locations to end users, it is
+possible to limit the scope of the exploit described herein immediately
+by restricting Glance policies related to image sharing:
+
+- "publicize_image" governs the ability to make an image available
+ to all users in a cloud, and such images appear in the default
+ image-list response for all users. It is restricted by default
+ to be admin-only.
+
+- "communitize_image" governs the ability to make an image available
+ to all users, though it does not appear in the default image-list
+ response for all users. The default configuration allows any
+ image owner to do this.
+
+- "add_member" governs the ability to share an image with particular
+ other projects. The default configuration allows any image owner
+ to do this.
+
+Restricting these to admin-only would limit the exploit to a single
+project, but given that it still allows for a disgruntled user to
+maliciously modify images within that project, it is not recommended
+as a long term solution.
I would not include this section. It gives false sign of security while it does not prevent using already shared, community or public images through the vector.
Erno Kuvaja (jokke) wrote : | #40 |
+OSSN-0065 suggested that this attack vector could be addressed by using
+policies, but that turned out not to be the case. The only way currently
+to close this vector is to deploy an internal-
+used by Nova and Cinder, with show_multiple_
+end-user-facing glance-api with show_multiple_
"The only way currently mitigate this vector is to deploy" The dual deployment does not close the attack vector, just limits it from external users. Without patching the gapi service code the only way to close this vector is to not enable "show_image_
Brian Rosmaita (brian-rosmaita) wrote : | #41 |
@Erno #39:
I see how this could be misleading. Instead of removing completely, since this is a "best practices" doc, how about I rephrase as a reminder about how malicious images can be spread to other users (independently of this exploit) ... or do you think that's already clear from our current documentation? (I don't have a problem with removing it completely.)
Also, is it worth reminding operators about image deactivation?
Brian Rosmaita (brian-rosmaita) wrote : | #42 |
@Erno #40:
I have no problem rephrasing as you suggest.
Erno Kuvaja (jokke) wrote : | #43 |
@Brian
I think repeating the same info/mistakes we did with OSSN-0065 is not beneficial. Thus I'd like to avoid go that route and just refer to OSSN-0065 for previous conversation. Thus my take to just remove the section copied from there.
Also realized that the Start of Discussion limits gives impression that this is vulnerable only when the nodes that has show multiple locations enable are public, but that's not the case. Fixin the deployments does not fix the issue, just limits it's accessibility from public.
+This note applies to you if you are operating an end-user-facing
+glance-api service with the 'show_multiple_
+(the default value is False) or if your end-user-facing glance-api has
+the 'show_image_
+Your exposure is less if you have *only* 'show_image_
+but the deployment configuration suggested below is recommended for your
+case as well.
I'd change the first paragraph to something like:
This note applies to you if you are operating a glance-api service with the 'show_multiple_
(the default value is False) or if your end-user-facing glance-api has
the 'show_image_
Your exposure is less if you have *only* 'show_image_
your glance-api that has 'show_multiple_
service facing only, but the deployment configuration suggested below is
recommended for your case as well.
Brian Rosmaita (brian-rosmaita) wrote : | #44 |
@Erno #43
I agree that the visibility stuff just confuses the issue. Based on your comments, I think I should restructure the entire note along these lines:
1. If you're using a COW backend configuration, you should deploy dual glances (probably won't use that term, but you know what i mean).
2. The COW backend efficiency/security tradeoff.
3. What we mean by "dual glances" <-- with reference to the nova/cinder config options.
4. Why: show_multiple_
5. Why: show_image_
@Everyone:
I won't get started on this until around 1800 UTC today, so if you have comments before then, please leave them!
Erno Kuvaja (jokke) wrote : | #45 |
@Brian ref #44
That sounds like a plan.
Brian Rosmaita (brian-rosmaita) wrote : | #46 |
- Rewritten version of the OSSN (12 Oct 2022) Edit (8.6 KiB, text/plain)
Rewritten version of the OSSN (12 Oct 2022)
Brian Rosmaita (brian-rosmaita) wrote : | #47 |
Added Jay and Julia to the bug since we've already decided it should be worked on in public, and just in case there's any ironic-specific info that should be added to the OSSN.
Dan Smith (danms) wrote : | #48 |
Brian, the latest version (12 Oct 2022) looks great to me, thanks!
Julia Kreger (juliaashleykreger) wrote : | #49 |
The latest also looks good to me. Thanks!
Jay Faulkner (jason-oldos) wrote : | #50 |
+1 to the doc, thanks Brian!
Abhishek Kekane (abhishek-kekane) wrote : | #51 |
+1 from me as well, thank you Brian!!
Pranali Deore (pranali-deore) wrote : | #52 |
+1 to the doc from me as well, Thanks !
I think we should go ahead and make it public as everyone agrees.
Jeremy Stanley (fungi) wrote : | #53 |
It looks like the revision attached to comment #46 addresses the points Erno raised, and has received consensus among other reviewers subscribed. In order not to further delay publication and make discussion of forward progress at the PTG harder, let's proceed with publication (even though I wouldn't normally recommend that on a Friday, the impact for this shouldn't pose a significant problem for our community).
Brian: Please push the final draft to https:/
description: | updated |
Changed in ossn: | |
status: | Incomplete → In Progress |
assignee: | nobody → Brian Rosmaita (brian-rosmaita) |
information type: | Private Security → Public |
tags: | added: security |
summary: |
- Malicious image data modification can happen when using COW + OSSN-090: Malicious image data modification can happen when using COW |
summary: |
- OSSN-090: Malicious image data modification can happen when using COW + OSSN-0090: Malicious image data modification can happen when using COW |
Brian Rosmaita (brian-rosmaita) wrote : | #54 |
Pushed the OSSN as:
https:/
Jeremy Stanley (fungi) wrote : | #55 |
Please also remember to send an OpenPGP-signed copy to the openstack-discuss and openstack-announce mailing lists (I'll expedite moderator approval through the latter).
information type: | Public → Public Security |
Nick Tait (nickthetait) wrote : | #56 |
I have a usability improvement idea: as OSSNs are designed for operators (as opposed to OpenStack developers) I would recommend replacing the acronym COW with the full name "Copy On Write." But I am not sure if such a tweak would be possible since it has already been released.
Jeremy Stanley (fungi) wrote : | #57 |
It's already been revised once since publication by https:/
Nick Tait (nickthetait) wrote : | #58 |
I agree a re-announcement is not needed. Got a local commit ready to publish, but its been ages since I've submitted a change to opendev... could someone point me toward some docs that tell me what my next step is?
Brian Rosmaita (brian-rosmaita) wrote : | #59 |
@Nick: this is probably more basic than you need, but it contains some links that may be helpful:
https:/
Nick Tait (nickthetait) wrote : | #60 |
Brian, that was a useful reminder but I ultimately gave up on trying to submit it. Anyway, I did reserve CVE-2022-4134 to track this issue.
I've added Pranali into the bug as new Glance PTL and she is aware of the issue.