2017-11-17 19:38:54 |
Matt Riedemann |
bug |
|
|
added bug |
2017-11-17 19:39:32 |
Matt Riedemann |
nominated for series |
|
nova/ocata |
|
2017-11-17 19:39:32 |
Matt Riedemann |
bug task added |
|
nova/ocata |
|
2017-11-17 19:39:32 |
Matt Riedemann |
nominated for series |
|
nova/newton |
|
2017-11-17 19:39:32 |
Matt Riedemann |
bug task added |
|
nova/newton |
|
2017-11-17 19:39:32 |
Matt Riedemann |
nominated for series |
|
nova/pike |
|
2017-11-17 19:39:32 |
Matt Riedemann |
bug task added |
|
nova/pike |
|
2017-11-17 19:39:43 |
Matt Riedemann |
nova/pike: status |
New |
Confirmed |
|
2017-11-17 19:47:24 |
Jeremy Stanley |
description |
As of the fix for bug 1664931 (OSSA-2017-005, CVE-2017-16239), a regression was introduced which allows a potential denial of service.
Once all computes are upgraded to >=Pike and using the (default) FilterScheduler, a rebuild with a new image will go through the scheduler. The FilterScheduler doesn't know that this is a rebuild on the same host and creates VCPU/MEMORY_MB/DISK_GB allocations in Placement against the compute node that the instance is running on. The ResourceTracker in the nova-compute service will not adjust the allocations after the rebuild, so what can happen is over multiple rebuilds of the same instance with a new image, the Placement service will report the compute node as not having any capacity left and will take it out of scheduling consideration.
Eventually the rebuild would fail once the compute node is at capacity, but an attacker could then simply create a new instance (on a new host) and start the process all over again.
I have a recreate of the bug here: https://review.openstack.org/#/c/521153/
This would not be a problem for anyone using another scheduler driver since only FilterScheduler uses Placement, and it wouldn't be a problem for any deployment that still has at least one compute service running Ocata code, because the ResourceTracker in the nova-compute service will adjust the allocations every 60 seconds.
Beyond this issue, however, there are other problems with the fix for bug 1664931:
1. Even if you're not using the FilterScheduler, e.g. using CachingScheduler, with the RamFilter or DiskFilter or CoreFilter enabled, if the compute node that the instance is running on is at capacity, a rebuild with a new image may still fail whereas before it wouldn't. This is a regression in behavior and the user would have to delete and recreate the instance with the new image.
2. Before the fix for bug 1664931, one could rebuild an instance on a disabled compute service, but now they cannot if the ComputeFilter is enabled (which it is by default and presumably enabled in all deployments).
3. Because of the way instance.image_ref is used with volume-backed instances, we are now *always* going through the scheduler during rebuild of a volume-backed instance, regardless of whether or not the image ref provided to the rebuild API is the same as the original in the root disk. I've already reported bug 1732947 for this.
--
The nova team has looked at some potential solutions, but at this point none of them are straightforward, and some involve using scheduler hints which are tied to filters that are not enabled by default (e.g. using the same_host scheduler hint which requires that the SameHostFilter is enabled). Hacking a fix in would likely result in more bugs in subtle or unforeseen ways not caught during testing.
Long-term we think a better way to fix the rebuild + new image validation is to categorize each scheduler filter as being a 'resource' or 'policy' filter, and with a rebuild + new image, we only run filters that are for policy constraints (like ImagePropertiesFilter) and not run RamFilter/DiskFilter/CoreFilter (or Placement for that matter). This would likely require an internal RPC API version change on the nova-scheduler interface, which is something we wouldn't want to backport to stable branches because of upgrade implications with the RPC API version bump.
At this point it might be best to just revert the fix for bug 1664931. We can still revert that through all of the upstream branches that the fix was applied to (newton is not EOL yet). This is obviously a pain for downstream consumers that have picked up and put out fixes for the CVE already. It would also mean publishing an errata for CVE-2017-16239 (we have to do that anyway probably) and saying it's now no longer fixed but is a publicly known issue.
Another possible alternative is shipping a new policy rule in nova that allows operators to disable rebuilding an instance with a new image, so they could decide based on the types of images and scheduler configuration they have if rebuilding with a new image is safe. Public and private cloud providers might see that rule useful in different ways, e.g. disable rebuild with a new image if you allow tenants to upload their own images to your cloud. |
This issue is being treated as a potential security risk under embargo. Please do not make any public mention of embargoed (private) security vulnerabilities before their coordinated publication by the OpenStack Vulnerability Management Team in the form of an official OpenStack Security Advisory. This includes discussion of the bug or associated fixes in public forums such as mailing lists, code review systems and bug trackers. Please also avoid private disclosure to other individuals not already approved for access to this information, and provide this same reminder to those who are made aware of the issue prior to publication. All discussion should remain confined to this private bug report, and any proposed fixes should be added to the bug as attachments.
As of the fix for bug 1664931 (OSSA-2017-005, CVE-2017-16239), a regression was introduced which allows a potential denial of service.
Once all computes are upgraded to >=Pike and using the (default) FilterScheduler, a rebuild with a new image will go through the scheduler. The FilterScheduler doesn't know that this is a rebuild on the same host and creates VCPU/MEMORY_MB/DISK_GB allocations in Placement against the compute node that the instance is running on. The ResourceTracker in the nova-compute service will not adjust the allocations after the rebuild, so what can happen is over multiple rebuilds of the same instance with a new image, the Placement service will report the compute node as not having any capacity left and will take it out of scheduling consideration.
Eventually the rebuild would fail once the compute node is at capacity, but an attacker could then simply create a new instance (on a new host) and start the process all over again.
I have a recreate of the bug here: https://review.openstack.org/#/c/521153/
This would not be a problem for anyone using another scheduler driver since only FilterScheduler uses Placement, and it wouldn't be a problem for any deployment that still has at least one compute service running Ocata code, because the ResourceTracker in the nova-compute service will adjust the allocations every 60 seconds.
Beyond this issue, however, there are other problems with the fix for bug 1664931:
1. Even if you're not using the FilterScheduler, e.g. using CachingScheduler, with the RamFilter or DiskFilter or CoreFilter enabled, if the compute node that the instance is running on is at capacity, a rebuild with a new image may still fail whereas before it wouldn't. This is a regression in behavior and the user would have to delete and recreate the instance with the new image.
2. Before the fix for bug 1664931, one could rebuild an instance on a disabled compute service, but now they cannot if the ComputeFilter is enabled (which it is by default and presumably enabled in all deployments).
3. Because of the way instance.image_ref is used with volume-backed instances, we are now *always* going through the scheduler during rebuild of a volume-backed instance, regardless of whether or not the image ref provided to the rebuild API is the same as the original in the root disk. I've already reported bug 1732947 for this.
--
The nova team has looked at some potential solutions, but at this point none of them are straightforward, and some involve using scheduler hints which are tied to filters that are not enabled by default (e.g. using the same_host scheduler hint which requires that the SameHostFilter is enabled). Hacking a fix in would likely result in more bugs in subtle or unforeseen ways not caught during testing.
Long-term we think a better way to fix the rebuild + new image validation is to categorize each scheduler filter as being a 'resource' or 'policy' filter, and with a rebuild + new image, we only run filters that are for policy constraints (like ImagePropertiesFilter) and not run RamFilter/DiskFilter/CoreFilter (or Placement for that matter). This would likely require an internal RPC API version change on the nova-scheduler interface, which is something we wouldn't want to backport to stable branches because of upgrade implications with the RPC API version bump.
At this point it might be best to just revert the fix for bug 1664931. We can still revert that through all of the upstream branches that the fix was applied to (newton is not EOL yet). This is obviously a pain for downstream consumers that have picked up and put out fixes for the CVE already. It would also mean publishing an errata for CVE-2017-16239 (we have to do that anyway probably) and saying it's now no longer fixed but is a publicly known issue.
Another possible alternative is shipping a new policy rule in nova that allows operators to disable rebuilding an instance with a new image, so they could decide based on the types of images and scheduler configuration they have if rebuilding with a new image is safe. Public and private cloud providers might see that rule useful in different ways, e.g. disable rebuild with a new image if you allow tenants to upload their own images to your cloud. |
|
2017-11-17 19:47:41 |
Jeremy Stanley |
bug task added |
|
ossa |
|
2017-11-17 19:47:59 |
Jeremy Stanley |
ossa: status |
New |
Incomplete |
|
2017-11-17 19:48:14 |
Jeremy Stanley |
bug |
|
|
added subscriber Nova Core security contacts |
2017-11-17 19:48:43 |
Matt Riedemann |
bug |
|
|
added subscriber Dan Smith |
2017-11-17 19:48:49 |
Matt Riedemann |
bug |
|
|
added subscriber Sylvain Bauza |
2017-11-18 21:36:11 |
Matt Riedemann |
bug task deleted |
nova/newton |
|
|
2017-11-18 21:36:17 |
Matt Riedemann |
bug task deleted |
nova/ocata |
|
|
2017-11-19 23:36:47 |
Dan Smith |
bug |
|
|
added subscriber Joshua Padman |
2017-11-20 03:14:57 |
Tristan Cacqueray |
cve linked |
|
2017-16239 |
|
2017-11-20 15:37:50 |
Jeremy Stanley |
bug |
|
|
added subscriber Matt Van Winkle |
2017-11-20 15:38:23 |
Jeremy Stanley |
bug |
|
|
added subscriber George Shuklin |
2017-11-20 15:38:51 |
Jeremy Stanley |
bug |
|
|
added subscriber Mohammed Naser |
2017-11-20 15:39:18 |
Jeremy Stanley |
bug |
|
|
added subscriber Nolwenn Cauchois |
2017-11-20 15:39:44 |
Jeremy Stanley |
bug |
|
|
added subscriber OSSG CoreSec |
2017-11-20 15:44:43 |
Tristan Cacqueray |
bug |
|
|
added subscriber Thomas Goirand |
2017-11-21 00:56:02 |
Tristan Cacqueray |
bug |
|
|
added subscriber Matthew Thode |
2017-11-21 18:37:07 |
Jeremy Stanley |
ossa: status |
Incomplete |
Confirmed |
|
2017-11-29 12:38:48 |
Jeremy Stanley |
summary |
Potential DoS by rebuilding the same instance with a new image multiple times |
Potential DoS by rebuilding the same instance with a new image multiple times (CVE-2017-17051) |
|
2017-11-29 12:39:08 |
Jeremy Stanley |
cve linked |
|
2017-17051 |
|
2017-11-30 00:37:22 |
Jeremy Stanley |
ossa: status |
Confirmed |
Fix Committed |
|
2017-11-30 00:37:27 |
Jeremy Stanley |
ossa: importance |
Undecided |
High |
|
2017-11-30 00:37:32 |
Jeremy Stanley |
ossa: assignee |
|
Jeremy Stanley (fungi) |
|
2017-12-05 15:02:30 |
Jeremy Stanley |
description |
This issue is being treated as a potential security risk under embargo. Please do not make any public mention of embargoed (private) security vulnerabilities before their coordinated publication by the OpenStack Vulnerability Management Team in the form of an official OpenStack Security Advisory. This includes discussion of the bug or associated fixes in public forums such as mailing lists, code review systems and bug trackers. Please also avoid private disclosure to other individuals not already approved for access to this information, and provide this same reminder to those who are made aware of the issue prior to publication. All discussion should remain confined to this private bug report, and any proposed fixes should be added to the bug as attachments.
As of the fix for bug 1664931 (OSSA-2017-005, CVE-2017-16239), a regression was introduced which allows a potential denial of service.
Once all computes are upgraded to >=Pike and using the (default) FilterScheduler, a rebuild with a new image will go through the scheduler. The FilterScheduler doesn't know that this is a rebuild on the same host and creates VCPU/MEMORY_MB/DISK_GB allocations in Placement against the compute node that the instance is running on. The ResourceTracker in the nova-compute service will not adjust the allocations after the rebuild, so what can happen is over multiple rebuilds of the same instance with a new image, the Placement service will report the compute node as not having any capacity left and will take it out of scheduling consideration.
Eventually the rebuild would fail once the compute node is at capacity, but an attacker could then simply create a new instance (on a new host) and start the process all over again.
I have a recreate of the bug here: https://review.openstack.org/#/c/521153/
This would not be a problem for anyone using another scheduler driver since only FilterScheduler uses Placement, and it wouldn't be a problem for any deployment that still has at least one compute service running Ocata code, because the ResourceTracker in the nova-compute service will adjust the allocations every 60 seconds.
Beyond this issue, however, there are other problems with the fix for bug 1664931:
1. Even if you're not using the FilterScheduler, e.g. using CachingScheduler, with the RamFilter or DiskFilter or CoreFilter enabled, if the compute node that the instance is running on is at capacity, a rebuild with a new image may still fail whereas before it wouldn't. This is a regression in behavior and the user would have to delete and recreate the instance with the new image.
2. Before the fix for bug 1664931, one could rebuild an instance on a disabled compute service, but now they cannot if the ComputeFilter is enabled (which it is by default and presumably enabled in all deployments).
3. Because of the way instance.image_ref is used with volume-backed instances, we are now *always* going through the scheduler during rebuild of a volume-backed instance, regardless of whether or not the image ref provided to the rebuild API is the same as the original in the root disk. I've already reported bug 1732947 for this.
--
The nova team has looked at some potential solutions, but at this point none of them are straightforward, and some involve using scheduler hints which are tied to filters that are not enabled by default (e.g. using the same_host scheduler hint which requires that the SameHostFilter is enabled). Hacking a fix in would likely result in more bugs in subtle or unforeseen ways not caught during testing.
Long-term we think a better way to fix the rebuild + new image validation is to categorize each scheduler filter as being a 'resource' or 'policy' filter, and with a rebuild + new image, we only run filters that are for policy constraints (like ImagePropertiesFilter) and not run RamFilter/DiskFilter/CoreFilter (or Placement for that matter). This would likely require an internal RPC API version change on the nova-scheduler interface, which is something we wouldn't want to backport to stable branches because of upgrade implications with the RPC API version bump.
At this point it might be best to just revert the fix for bug 1664931. We can still revert that through all of the upstream branches that the fix was applied to (newton is not EOL yet). This is obviously a pain for downstream consumers that have picked up and put out fixes for the CVE already. It would also mean publishing an errata for CVE-2017-16239 (we have to do that anyway probably) and saying it's now no longer fixed but is a publicly known issue.
Another possible alternative is shipping a new policy rule in nova that allows operators to disable rebuilding an instance with a new image, so they could decide based on the types of images and scheduler configuration they have if rebuilding with a new image is safe. Public and private cloud providers might see that rule useful in different ways, e.g. disable rebuild with a new image if you allow tenants to upload their own images to your cloud. |
As of the fix for bug 1664931 (OSSA-2017-005, CVE-2017-16239), a regression was introduced which allows a potential denial of service.
Once all computes are upgraded to >=Pike and using the (default) FilterScheduler, a rebuild with a new image will go through the scheduler. The FilterScheduler doesn't know that this is a rebuild on the same host and creates VCPU/MEMORY_MB/DISK_GB allocations in Placement against the compute node that the instance is running on. The ResourceTracker in the nova-compute service will not adjust the allocations after the rebuild, so what can happen is over multiple rebuilds of the same instance with a new image, the Placement service will report the compute node as not having any capacity left and will take it out of scheduling consideration.
Eventually the rebuild would fail once the compute node is at capacity, but an attacker could then simply create a new instance (on a new host) and start the process all over again.
I have a recreate of the bug here: https://review.openstack.org/#/c/521153/
This would not be a problem for anyone using another scheduler driver since only FilterScheduler uses Placement, and it wouldn't be a problem for any deployment that still has at least one compute service running Ocata code, because the ResourceTracker in the nova-compute service will adjust the allocations every 60 seconds.
Beyond this issue, however, there are other problems with the fix for bug 1664931:
1. Even if you're not using the FilterScheduler, e.g. using CachingScheduler, with the RamFilter or DiskFilter or CoreFilter enabled, if the compute node that the instance is running on is at capacity, a rebuild with a new image may still fail whereas before it wouldn't. This is a regression in behavior and the user would have to delete and recreate the instance with the new image.
2. Before the fix for bug 1664931, one could rebuild an instance on a disabled compute service, but now they cannot if the ComputeFilter is enabled (which it is by default and presumably enabled in all deployments).
3. Because of the way instance.image_ref is used with volume-backed instances, we are now *always* going through the scheduler during rebuild of a volume-backed instance, regardless of whether or not the image ref provided to the rebuild API is the same as the original in the root disk. I've already reported bug 1732947 for this.
--
The nova team has looked at some potential solutions, but at this point none of them are straightforward, and some involve using scheduler hints which are tied to filters that are not enabled by default (e.g. using the same_host scheduler hint which requires that the SameHostFilter is enabled). Hacking a fix in would likely result in more bugs in subtle or unforeseen ways not caught during testing.
Long-term we think a better way to fix the rebuild + new image validation is to categorize each scheduler filter as being a 'resource' or 'policy' filter, and with a rebuild + new image, we only run filters that are for policy constraints (like ImagePropertiesFilter) and not run RamFilter/DiskFilter/CoreFilter (or Placement for that matter). This would likely require an internal RPC API version change on the nova-scheduler interface, which is something we wouldn't want to backport to stable branches because of upgrade implications with the RPC API version bump.
At this point it might be best to just revert the fix for bug 1664931. We can still revert that through all of the upstream branches that the fix was applied to (newton is not EOL yet). This is obviously a pain for downstream consumers that have picked up and put out fixes for the CVE already. It would also mean publishing an errata for CVE-2017-16239 (we have to do that anyway probably) and saying it's now no longer fixed but is a publicly known issue.
Another possible alternative is shipping a new policy rule in nova that allows operators to disable rebuilding an instance with a new image, so they could decide based on the types of images and scheduler configuration they have if rebuilding with a new image is safe. Public and private cloud providers might see that rule useful in different ways, e.g. disable rebuild with a new image if you allow tenants to upload their own images to your cloud. |
|
2017-12-05 15:02:43 |
Jeremy Stanley |
information type |
Private Security |
Public Security |
|
2017-12-05 15:08:24 |
Matt Riedemann |
nova: assignee |
|
Dan Smith (danms) |
|
2017-12-05 15:08:27 |
Matt Riedemann |
nova: status |
Triaged |
In Progress |
|
2017-12-05 15:08:29 |
Matt Riedemann |
nova/pike: status |
Confirmed |
In Progress |
|
2017-12-05 15:08:32 |
Matt Riedemann |
nova/pike: importance |
Undecided |
High |
|
2017-12-05 16:11:34 |
OpenStack Infra |
nova: assignee |
Dan Smith (danms) |
Matt Riedemann (mriedem) |
|
2017-12-05 16:14:10 |
Matt Riedemann |
nova: assignee |
Matt Riedemann (mriedem) |
Dan Smith (danms) |
|
2017-12-05 16:25:39 |
OpenStack Infra |
nova/pike: assignee |
|
Matt Riedemann (mriedem) |
|
2017-12-05 16:37:33 |
Jeremy Stanley |
summary |
Potential DoS by rebuilding the same instance with a new image multiple times (CVE-2017-17051) |
[OSSA-2017-006] Potential DoS by rebuilding the same instance with a new image multiple times (CVE-2017-17051) |
|
2017-12-05 16:37:48 |
Jeremy Stanley |
ossa: status |
Fix Committed |
Fix Released |
|
2017-12-06 19:00:30 |
OpenStack Infra |
nova: status |
In Progress |
Fix Released |
|
2017-12-09 02:06:12 |
OpenStack Infra |
nova/pike: status |
In Progress |
Fix Committed |
|