Bug #1392527 “[OSSA 2015-017] Deleting instance while resize ins...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Tushar Patil (tpatil) wrote on 2014-11-13:

#1

test_resize.py Edit (2.4 KiB, text/x-python)

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-11-14:

#2

This sounds like a potential denial of service vector, so I've added an incomplete security advisory task and subscribed Nova's core security reviewers to comment.

Changed in ossa:
status:	New → Incomplete

Revision history for this message

Thierry Carrez (ttx) wrote on 2014-11-17:

#3

Looks pretty valid to me. I bet Nova coresec will confirm :)

Changed in ossa:
importance:	Undecided → High
status:	Incomplete → Confirmed

Revision history for this message

Thierry Carrez (ttx) wrote on 2014-11-17:

#4

Independently reported by george-shuklin as bug 1387543

Revision history for this message

Andrew Laski (alaski) wrote on 2014-11-19:

#5

This is a valid issue. But it should be limited to issuing a delete before the instance.vm_state is 'resized' since there is code to cleanup a resized instance if vm_state is 'resized'.

It's going to take a little bit of thought to get the coordination between delete and resize correct for this.

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2014-12-01:

#6

@Tushar Do you want to acknowledge a company along with your name or is this reported independently ?

Here is impact description draft #1:

Title: Nova instance delete in resize state may leak
Reporter: Tushar Patil
Products: Nova
Versions: up to 2014.1.3 and 2014.2

Description:
Tushar Patil reported a vulnerability in Nova resize state. If an authenticated user deletes an instance while it is in resize state, it will cause the original instance to not be deleted from the compute node it was running on. An attacker can use this to launch a denial of service attack. All Nova setups are affected.

Revision history for this message

George Shuklin (george-shuklin) wrote on 2014-12-01:

#7

I was first to report it... https://bugs.launchpad.net/nova/+bug/1387543 (2014-10-30)

Revision history for this message

Tushar Patil (tpatil) wrote on 2014-12-01:

#8

Tristan: If George has reported this issue first, let's give him the credit. I don't have permissions to view bug ID 1387543 so I cannot comment on it.

My company name: NTT DATA, Inc.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2014-12-01:

#9

@George, oups I messed up the time line! Indeed you are the original reporter... So should I also credit "Webzilla LTD" along with your name ?

Revision history for this message

George Shuklin (george-shuklin) wrote on 2014-12-01:

#10

I think it is independent discovery, we both should be noted.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2014-12-01:

#11

@George @Tushar Alright, apologizes for the noise, I put both of you as reporter in this impact description draft #2:

Title: Nova instance delete in resize state may leak
Reporter: George Shuklin (Webzilla LTD) and Tushar Patil (NTT DATA, Inc)
Products: Nova
Versions: up to 2014.1.3 and 2014.2

Description:
George Shuklin from Webzilla LTD and Tushar Patil from NTT DATA, Inc independently reported a vulnerability in Nova resize state. If an authenticated user deletes an instance while it is in resize state, it will cause the original instance to not be deleted from the compute node it was running on. An attacker can use this to launch a denial of service attack. All Nova setups are affected.

Changed in ossa:
status:	Confirmed → Triaged

Tristan Cacqueray (tristan-cacqueray) on 2014-12-01

Changed in ossa:
assignee:	nobody → Tristan Cacqueray (tristan-cacqueray)

Revision history for this message

George Shuklin (george-shuklin) wrote on 2014-12-01:

#12

Thanks, it's nice, but where is the fix? And I need to backport it to havana... (sigh).

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-12-02:

#13

George: our vulnerability managers and software developers work in parallel documenting and fixing security bugs. Hopefully a backportable (at least as far as stable/icehouse which is our oldest supported branch) fix will be along soon from one of the Nova developers since Andrew has triaged it at High importance.

Tristan: your latest impact description in comment #11 looks accurate--thanks!

Revision history for this message

Thierry Carrez (ttx) wrote on 2014-12-08:

#14

Impact desc: 2014.2 up to 2014.2.1 (which was out last week) otherwise looks good.

Revision history for this message

Thierry Carrez (ttx) wrote on 2014-12-12:

#15

@alaski: any progress on a patch for this one ?

Revision history for this message

Andrew Laski (alaski) wrote on 2015-01-12:

#16

Sorry for the long delay. Still looking for someone to tackle this, or for time to do it myself.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-02-05:

#17

@nova-coresec: this have been reported almost 2 month ago... Any progress on the fix ?

Updating affected version line:

Title: Nova instance delete in resize state may leak
Reporter: George Shuklin (Webzilla LTD) and Tushar Patil (NTT DATA, Inc)
Products: Nova
Versions: up to 2014.1.3 and 2014.2 versions up to 2014.2.1

Description:
George Shuklin from Webzilla LTD and Tushar Patil from NTT DATA, Inc independently reported a vulnerability in Nova resize state. If an authenticated user deletes an instance while it is in resize state, it will cause the original instance to not be deleted from the compute node it was running on. An attacker can use this to launch a denial of service attack. All Nova setups are affected.

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2015-02-18:

#18

0001-Delete-orphaned-instance-files-from-compute-nodes.patch Edit (11.8 KiB, text/plain)

Hi,

The attached fix addresses all the issues mentioned in the attached excel sheet. Please have a look at it.
The migration db interface code is already submitted by me in patch [1] but since it is not merged yet I have included it here.
[1] : https://review.openstack.org/122347

As of now lot of unit tests are failing because of proposed changes. I will fix broken unit tests once my patch is approved by the security team.
Please review the code.

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2015-02-18:

#19

resize_scenarios.xlsx Edit (25.8 KiB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-02-19:

#20

I'm not sure what the problem is but I keep getting 'Processing Failed' when I try to view the patch.

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2015-02-19:

#21

Hi Tony,

I had also tried the same and I was able to view the patch twice but after that getting 'Processing Failed' error continuously.
I tried to view the patch attached on some other bug and there also I am getting same error.

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-02-19:

#22

@Abhishek Thanks for confirming it's not just me. I'll check again tomorrow.

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-02-19:

#23

Impact desc +1

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-02-20:

#24

I had to remove the "patch" flag from the attachment to be able to view it.

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2015-02-20:

#25

0001-Delete-orphaned-instance-files-from-compute-nodes.patch Edit (11.8 KiB, text/plain)

Hi,

The patch attached in comment #18 is giving "Processing Failed" when you try to view it.
Attaching the same patch again without setting "patch" flag.

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-02-20:

#26

Firstly I want to admit upfront:
1) I'm new to reviewing code in LP. I'm sorry if there is an established protocol I'm breaking. Just tell me the right thing to do.
2) I could be off base here. I'll clearly continue to dig into this.

While there is no doubt that the get_latest_by_instance() API changes are neat. They represent an (IIUC) RPC bump, which would be great to avoid for backports to stable/*. Can't we get the same result using MigrationList.get_by_filters() and then open code the lookup for latest?

Based on the scenarios you logged it looks like catching the InstanceNotFound exception should be enough but I have a vauge nagging fear that we're missing something. No doubt what you've done is a long way to a solution.

Revision history for this message

Andrew Laski (alaski) wrote on 2015-02-23:

#27

Thanks for putting up a patch!

Reviewing the patch has caused me to dig into this issue deeply once again so I want to record some thoughts here.

This issue is exacerbated by not having locks during the resize process. The terminate process would wait for the completion of the resize at which point the delete would be handled fine. Adding locking is not trivial because the resize executes on both source and dest hosts.

There is no good way to interrupt a resize. If a delete comes in during a resize operation with no way to interrupt the resize or grab a lock then it's always possible that the resize disk could be created after any cleanup that's done. The traditional way to interrupt a task is to delete the instance db record which will cause an exception to be raised on the next instance.save(), but the delete process doesn't know when the resize code will hit that.

As far as cleaning up the _resize disks, the poll_unconfirmed _resizes task should eventually handle it but it typically has a long window before it will clean up an uncomfirmed resize and the task is turned off by default.

Looking at the patch specifically it doesn't appear to fully address the issue I raised above. Since the resize is not interrupted it's possible that the cleanup code will run before the _resize disk is created. It will reduce the timewindow for the race condition but not eliminate it. It's also libvirt specific and this vulnerability could exist for any driver that creates a separate disk for resize. The RPC bump is also an issue for backporting the change. And finally the use of utils.execute to ssh and rm -rf a directory is concerning.

Thinking simply, just ensuring that virt driver destroy() methods look for and clean up resize disks would do a lot to alleviate this.

Revision history for this message

George Shuklin (george-shuklin) wrote on 2015-02-24:

#28

I'm not the developer, just operator.

From my point of view (and original report) the problem is with scp/rsync, which keeps copying files after instance is deleted. It keeps file opened even if you remove them, and eats network and cpu.

My naiive approach (like if I would be developer) would be:

1. Save pid of scp/rsync process in migration record.
2. If instance is terminated during migration, mark is as 'aborting' in migration record.
3. On source host for each migration in aborting state (related to 'own' host):
a) Kill scp/rsync (by pid)
b) Delete local files
4. On destination:
a) Delete local files

To keep track of the files on all systems, they all should be saved in migration record.

Again, I'm not the developer, I don't know all aspects of RPC around nova-api/compute, but as operator I really don't want to see stray scp/rsync processes on compute, operating on deleted disks.

Revision history for this message

Andrew Laski (alaski) wrote on 2015-02-24:

#29

George: that seems to be to be a slightly different issue that what's reported here, unless I've missed the subtlety of what's reported here. I was under the impression that the _resize disks were being left around indefinitely, vs your concern that they're not cleaned up immediately.

There has been talk for quite a while about getting to a point where tasks are cancel-able within Nova, but at this point there's very little to work with to make that happen. The solution you've proposed would involve a db migration and multiple RPC changes while being primarily libvirt specific. Making that work generally across hypervisors will take a bit more abstraction and effort, and doing that in a backwards compatible way may end up being a challenge as well.

Revision history for this message

Andrew Laski (alaski) wrote on 2015-02-24:

#30

For the issue reported here I would propose starting with a simple solution that cleans up any resize disks during the virt driver destroy() call. The xenapi driver already does a similiar thing for rescue disks.

For the issue George is discussion I would look at adding a small store of instance->task mappings to each driver that it can use to clean up tasks during the destroy() process.

Revision history for this message

George Shuklin (george-shuklin) wrote on 2015-02-24:

#31

Yes, seems fine. One important notice: there is few methods of copy disk (rsync via rsyncd, rsync via ssh, scp) - and all them should be terminated properly.

Btw: If you have problems with reproducing setup, I can give access to realservers laboratory installation with few compute hosts and scp as migration method (juno).

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-03-02:

#32

@Andrew: do you plan to propose a simpler patch ?

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-03-03:

#33

@ttx Yes we're working on one. I was blocked but a frustrating set of unrelated bugs.

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2015-03-12:

#34

0001-Delete-orphaned-instance-files-from-compute-nodes.patch Edit (5.6 KiB, text/plain)

Hi,

Please review the attached patch.

In this patch I have added periodic tasks to cleanup instance files left out either on the source or destination compute node when the instance is deleted during resizing.

This is a much cleaner approach than the previous attached patch as it doesn't involve adding new interface to the virt driver and compute rpcapi.

The attached patch and the patch [1] which is currently under review, will resolve all the issues mentioned in 'resize_scenarios.xlsx ' (refer comment #19).

[1] https://review.openstack.org/#/c/154761/

Revision history for this message

George Shuklin (george-shuklin) wrote on 2015-03-12:

#35

Hello.

Thanks for work. Is this fix cover the killing of rsync/scp processes in the case of 'cold migration'?

Revision history for this message

Tushar Patil (tpatil) wrote on 2015-03-12:

#36

Hi George:

This patch doesn't address killing orphan rsync/scp processes left out in the migrate_disk_and_power_off method. I think this fix is outside the scope of the problem we are trying to solve.

Can you please report this issue as a new bug or else I can do it for you?

Revision history for this message

George Shuklin (george-shuklin) wrote on 2015-03-13:

#37

I reported this problem (lost 'scp' processes), and it was linked as 'duplicate' here.

See https://bugs.launchpad.net/nova/+bug/1387543

If they are differ, duplication link should be removed and both should be treated as separate vulnerabilities.

@ttx, please clarify.

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-03-16:

#38

OK, unduplicated

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-03-16:

#39

@nova-coresec, please review proposed patch

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-03-17:

#40

This is not a full review but after some review and testing the patch is looking good to me.

It doesn't eliminate the bug but it does reduce the severity and allow operators to mitigate.

I'll do a more in depth review tomorrow and continue to test.

Michael Still (mikal) on 2015-06-01

Changed in nova:
assignee:	nobody → Michael Still (mikalstill)

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-06-22:

#72

This also needs backports to Juno and Kilo.

Revision history for this message

Dan Smith (danms) wrote on 2015-06-22:

#73

+2 on the current patch. I didn't examine the tests closely, but if they pass, agree that we can fix them post-merge.

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-06-23:

#74

The backport to Kilo is trivial, the Juno one is a little more complex.

Tony Breeds (o-tony) on 2015-06-24

no longer affects:

nova/icehouse

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-06-26:

#75

0001-Delete-orphaned-instance-files-from-compute-nodes.patch Edit (9.4 KiB, text/plain)

Juno backport. This isn't a simple cherry-pick but the changes aren;t complex and are due
1) tests moving in kilo
2) in kilo nova/objects/base.py has "obj_as_admin()" that's not in Juno

Passes (nova) tox -epy27,pep8
Tempest didn't pass. It failed differently on 3 runs so I'm confused.

I'll try to build a Juno multi-compute stack early next week to perform additional checks.

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-06-26:

#76

0001-Delete-orphaned-instance-files-from-compute-nodes.patch Edit (9.7 KiB, text/plain)

KIlo backport. This is a simple cherry-pick from master

Passes (nova) tox -epy27,pep8
Tempest didn't pass. It failed differently on 3 runs so I'm confused.

I'll try to build a Kilo multi-compute stack early next week to perform additional checks.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-06-29:

#77

@nova-coresec, please review proposed patch.

Here is the updated impact description (removing Icehouse from affected releases):

Title: Nova instance delete in resize state may leak
Reporter: George Shuklin (Webzilla LTD) and Tushar Patil (NTT DATA, Inc)
Products: Nova
Affects: 2014.2 versions through 2014.2.3, and version 2015.1.0

Description:
George Shuklin from Webzilla LTD and Tushar Patil from NTT DATA, Inc independently reported a vulnerability in Nova resize state. If an authenticated user deletes an instance while it is in resize state, it will cause the original instance to not be deleted from the compute node it was running on. An attacker can use this to launch a denial of service attack. All Nova setups are affected.

Revision history for this message

Grant Murphy (gmurphy) wrote on 2015-06-29:

#78

+1 impact description.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2015-06-29:

#79

Should now be "Affects: versions through 2014.2.3, and version 2015.1.0". Other than that, the updated impact description looks good to me.

Tristan Cacqueray (tristan-cacqueray) on 2015-07-02

Changed in ossa:
status:	Triaged → In Progress

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-07-02:

#80

Patch proposed in comment #70 (liberty), #76 (kilo) and #75 (juno) all succeed run_tests.sh here...
Can we please get another nova core review before setting a disclosure date ?

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-07-02:

#81

@Tristan: run_tests.sh or run_tempest.sh ?
@mikal, @dan: Comments on my backports?

Tristan Cacqueray (tristan-cacqueray) on 2015-07-03

summary:

Deleting instance while resize instance is running leads to unuseable
- compute nodes
+ compute nodes (CVE-2015-3280)

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-07-06: Re: Deleting instance while resize instance is running leads to unuseable compute nodes (CVE-2015-3280)

#82

Tony, I only ran unit test (run_tests.sh)

Revision history for this message

Jeremy Stanley (fungi) wrote on 2015-07-06:

#83

For the impact description in comment #77, let's try to avoid non-vulnerability-related uses of the overloaded term "leak" to reduce confusion. How about switching the title to something like "Nova may fail to delete images in resize state".

Revision history for this message

Grant Murphy (gmurphy) wrote on 2015-07-09:

#84

Could we get a nova core to +2 these patches so we can set the disclosure date etc?

Revision history for this message

Dan Smith (danms) wrote on 2015-07-16:

#85

The backports look okay to me, as best I can tell without seeing test runs (which is not very well). Sounds like tony is still seeing tempest failures on the kilo one. Not sure what to say about that.

Revision history for this message

Tushar Patil (tpatil) wrote on 2015-07-16:

#86

We will run tempest tests against stable/kilo after applying kilo cherrypicked patch attached to comment#76 and see if we can find out any issues.

Revision history for this message

Rajesh Tailor (rajesh-tailor) wrote on 2015-07-20:

#87

Hi Tony,

A) I have ran tempest tests on stable/kilo without applying security patch attached in comment #76.
In this case, following 7 tempest test were failing.

1) tempest.api.network.test_metering_extensions.MeteringIpV6TestJSON
2) tempest.api.network.test_metering_extensions.MeteringTestJSON
3) tempest.thirdparty.boto.test_ec2_instance_run.InstanceRunTest
4) tempest.thirdparty.boto.test_s3_buckets.S3BucketsTest.test_create_and_get_delete_bucket[id-4678525d-8da0-4518-81c1-f1f67d595b00]
5) tempest.thirdparty.boto.test_s3_buckets.S3BucketsTest
6) tempest.thirdparty.boto.test_s3_ec2_images.S3ImagesTest
7) tempest.thirdparty.boto.test_s3_objects.S3BucketsTest.test_create_get_delete_object[id-4eea567a-b46a-405b-a475-6097e1faebde]

B) When I ran tempest tests after applying security patch (attached in comment # 76) on stable/kilo, in that case also above mentioned tests were failing.

Could you please share which tempest tests were failing in your stable/kilo environment, so that I can analyze whether these tests are failing because of proposed changes or not ?

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2015-08-11:

#88

Hi Tony,

Any updates on the failures. Till then could we get a nova core to +2 these patches?

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-08-14:

#89

Hi all,
I did some testing yesterday on the juno and kilo backports and for both branches I see a similar (though not identical) set of failures
I get either 7 or 11 failures of these I'm ignoring the tempest.thirdparty.boto.* tests. In both cases I don't see a change in the failure rates with the appropriate patch applied.

However that's all done with an all-in-one devstack setup. I think It's important to test with a multi-node setup. (which I'm doing now)

Once I've completed those tests I'll post a rebased set of patches.

Tony.

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-08-17:

#90

Patch against master Edit (9.7 KiB, text/plain)

Download full text (3.9 KiB)

Results from testing

master:
balder:tmp tony8129$ cat master.log-summary
tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_list_migrations_in_flavor_resize_situation
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_auto_to_manual
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_server_sequence_suspend_resume
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern

master+patch
balder:tmp tony8129$ cat master-patched.log-summary
tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_list_migrations_in_flavor_resize_situation
tempest.api.compute.servers.test_create_server.ServersTestJSON.test_create_server_with_scheduler_hint_group
tempest.api.compute.servers.test_create_server.ServersTestManualDisk.test_create_server_with_scheduler_hint_group
tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_verify_resize_state
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_auto_to_manual
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server
tempest.api.compute.servers.test_servers_negative.ServersNegativeTestJSON.test_shelve_shelved_server
tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern

The delta looks like:
--- master.log-summary 2015-08-17 16:35:21.000000000 +1000
+++ master-patched.log-summary 2015-08-17 16:35:21.000000000 +1000
@@ -1,0 +2...

Results form testing:

Kilo
balder:tmp tony8129$ cat master-patched.log-summary 
tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_list_migrations_in_flavor_resize_situation
tempest.api.compute.servers.test_create_server.ServersTestJSON.test_create_server_with_scheduler_hint_group
tempest.api.compute.servers.test_create_server.ServersTestManualDisk.test_create_server_with_scheduler_hint_group
tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_verify_resize_state
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_auto_to_manual
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server
tempest.api.compute.servers.test_servers_negative.ServersNegativeTestJSON.test_shelve_shelved_server
tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern

Kilo+patch
balder:tmp tony8129$ cat kilo-patched.log-summary
tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_list_migrations_in_flavor_resize_situation
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern

Delta:
--- kilo.log-summary	2015-08-17 16:35:21.000000000 +1000
+++ kilo-patched.log-summary	2015-08-17 16:35:21.000000000 +1000
@@ -1,5 +1 @@
-tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_verify_resize_state
-tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
-tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
-tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
-tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
+tempest.api.compute.admin.test_migrations.MigrationsAdminTest.test_list_migrations_in_flavor_resize_situation
@@ -6,0 +3 @@
+tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
@@ -9 +5,0 @@
-tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-08-17:

#92

Bakport to Juno Edit (9.4 KiB, text/plain)

Juno
balder:tmp tony8129$ cat juno.log-summary
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_snapshot_create_with_volume_in_use
tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern

Juno+patch
balder:tmp tony8129$ cat juno-patched.log-summary
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern

Delta:
--- juno.log-summary 2015-08-17 16:35:21.000000000 +1000
+++ juno-patched.log-summary 2015-08-17 16:35:21.000000000 +1000
@@ -3,3 +3 @@
-tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
-tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
-tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_snapshot_create_with_volume_in_use
+tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
@@ -6,0 +5,2 @@
+tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
+tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
@@ -8 +8 @@
-tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
+tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern

Juno
balder:tmp tony8129$ cat juno.log-summary 
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_snapshot_create_with_volume_in_use
tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern

Juno+patch
balder:tmp tony8129$ cat juno-patched.log-summary 
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_resize_server_from_manual_to_auto
tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_update_server_from_auto_to_manual
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern

Delta:
--- juno.log-summary	2015-08-17 16:35:21.000000000 +1000
+++ juno-patched.log-summary	2015-08-17 16:35:21.000000000 +1000
@@ -3,3 +3 @@
-tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm
-tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
-tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_snapshot_create_with_volume_in_use
+tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
@@ -6,0 +5,2 @@
+tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
+tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
@@ -8 +8 @@
-tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
+tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-08-17:

#93

I haven't looked at the logs in detail yet as collecting the data took longer than expected.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-08-17:

#94

@nova-coresec, please review last proposed patch

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-08-17:

#95

All three patches succeed ./run_tests.sh here.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-08-26:

#96

@nova-coresec, we need someone to approve the proposed patch before moving on this bug, please review patch in comments #90, #91 and #92

Revision history for this message

Michael Still (mikal) wrote on 2015-08-27:

#97

All three patches look reasonable to me.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-08-27:

#98

Here is the updated impact description (fixing affected version and including comment #83 suggestions):

Title: Nova may fail to delete images in resize state
Reporter: George Shuklin (Webzilla LTD) and Tushar Patil (NTT DATA, Inc)
Products: Nova
Affects: 2014.2 versions through 2014.2.3, and 2015.1 versions through 2015.1.1

Description:
George Shuklin from Webzilla LTD and Tushar Patil from NTT DATA, Inc independently reported a vulnerability in Nova resize state. If an authenticated user deletes an instance while it is in resize state, it will cause the original instance to not be deleted from the compute node it was running on. An attacker can use this to launch a denial of service attack. All Nova setups are affected.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-08-27:

#99

Proposed disclosure date:
2015-09-01, 1500UTC

Changed in ossa:
status:	In Progress → Fix Committed

Tristan Cacqueray (tristan-cacqueray) on 2015-09-01

information type:

Private Security → Public Security

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-01: Fix proposed to nova (master)

#100

Fix proposed to branch: master
Review: https://review.openstack.org/219299

Changed in nova:
assignee:	Michael Still (mikalstill) → Tristan Cacqueray (tristan-cacqueray)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-01: Fix proposed to nova (stable/kilo)

#101

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/219300

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-01: Fix proposed to nova (stable/juno)

#102

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/219301

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-02: Fix merged to nova (master)

#103

Reviewed: https://review.openstack.org/219299
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=18d6b5cc79973fc553daf7a92f22cce4dc0ca013
Submitter: Jenkins
Branch: master

commit 18d6b5cc79973fc553daf7a92f22cce4dc0ca013
Author: Rajesh Tailor <email address hidden>
Date: Wed Mar 4 05:05:19 2015 -0800

Delete orphaned instance files from compute nodes

    While resizing/revert-resizing instance, if instance gets deleted
    in between, then instance files remains either on the source or
    destination compute node.

    To address this issue, added a new periodic task
    '_cleanup_incomplete_migrations' which takes care of deleting
    instance files from source/destination compute nodes and then
    mark migration record as failed so that it doesn't appear again
    in the next periodic task run.

SecurityImpact

Closes-Bug: 1392527
Change-Id: I9866d8e32e99b9f907921f4b226edf7b62bd83a7

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-09-03

Changed in nova:
milestone:	none → liberty-3
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-14: Fix merged to nova (stable/kilo)

#104

Reviewed: https://review.openstack.org/219300
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5642b17cf4a18543f02a29ddaf4ab1b8d9819b76
Submitter: Jenkins
Branch: stable/kilo

commit 5642b17cf4a18543f02a29ddaf4ab1b8d9819b76
Author: Rajesh Tailor <email address hidden>
Date: Wed Mar 4 05:05:19 2015 -0800

Delete orphaned instance files from compute nodes

    While resizing/revert-resizing instance, if instance gets deleted
    in between, then instance files remains either on the source or
    destination compute node.

    To address this issue, added a new periodic task
    '_cleanup_incomplete_migrations' which takes care of deleting
    instance files from source/destination compute nodes and then
    mark migration record as failed so that it doesn't appear again
    in the next periodic task run.

SecurityImpact

    Closes-Bug: 1392527
    Change-Id: I9866d8e32e99b9f907921f4b226edf7b62bd83a7
    (cherry picked from commit 18d6b5cc79973fc553daf7a92f22cce4dc0ca013)

Tristan Cacqueray (tristan-cacqueray) on 2015-09-28

summary:

- Deleting instance while resize instance is running leads to unuseable
- compute nodes (CVE-2015-3280)
+ [OSSA 2015-017] Deleting instance while resize instance is running leads
+ to unuseable compute nodes (CVE-2015-3280)

Matthew Booth (mbooth-9) on 2015-10-01

Changed in nova:
status:	Fix Released → New

Revision history for this message

Matthew Booth (mbooth-9) wrote on 2015-10-01:

#105

A fix for this issue has been released, but I don't believe it addresses the CVE. The essence of the CVE is that by manipulating a race condition a malicious authenticated user can cause a compute host to run out of disk space. The fix is to periodically clean up the disk space. This fix means that a malicious authenticated user can still waste disk space during the interval of the periodic task. With automation, this is still likely to be a lot.

Contrast this with change Ie03acc00a7c904aec13c90ae6a53938d08e5e0c9 . In this instance, we decided that the ability for a malicious authenticated user to waste disk space for any period of time was a security bug, which we closed. The fix for this CVE is not consistent.

I note that although this race condition is somewhat complicated, it is not described anywhere in either the commit message or code comments of the accepted fix. I will describe it here:

RESIZE:

On src compute:
1. Copy instance files from src to dest
2. Set instance.host == dest
3. Cast finish_resize on dest

On dest compute:
4. Set instance.vm_state = RESIZED

DELETE, on api:
1. If instance.vm_state == RESIZED: confirm_resized()
2. Call terminate_instance on instance.host

If delete is called between resize steps 1 and 2, delete will be called on src because instance.host has not yet been updated. As vm_state != RESIZED, confirm_resized will not be called, and files will be left on dest.

If delete is called between resize steps 2 and 4, delete will be called on dest, because instance.host has been updated. As vm_state != RESIZED, confirm_resized will not be called, and files will be left on src.

It should be possible to fix this properly by closing the race condition. The periodic task might remain, but would be redundant in normal operation. The outline of the fix is:

* Remove the confirm_resized call from nova api delete.
* Add a synchronized(instance.uuid) block around RESIZE steps 2 and 3 above (yes, including the cast).
* Add a synchronized(instance.uuid) block around finish_resize
* Add check inside the synchronized blocks in finish_resize for deleted. Fail if deleted.
* Add check inside do_terminate_instance for current migration. If current migration, also cleanup other node following current periodic task.

A fix for this issue has been released, but I don't believe it addresses the CVE. The essence of the CVE is that by manipulating a race condition a malicious authenticated user can cause a compute host to run out of disk space. The fix is to periodically clean up the disk space. This fix means that a malicious authenticated user can still waste disk space during the interval of the periodic task. With automation, this is still likely to be a lot.

Contrast this with change Ie03acc00a7c904aec13c90ae6a53938d08e5e0c9 . In this instance, we decided that the ability for a malicious authenticated user to waste disk space for any period of time was a security bug, which we closed. The fix for this CVE is not consistent.

I note that although this race condition is somewhat complicated, it is not described anywhere in either the commit message or code comments of the accepted fix. I will describe it here:

RESIZE:

On src compute:
1. Copy instance files from src to dest
2. Set instance.host == dest
3. Cast finish_resize on dest

On dest compute:
4. Set instance.vm_state = RESIZED

DELETE, on api:
1. If instance.vm_state == RESIZED: confirm_resized()
2. Call terminate_instance on instance.host

If delete is called between resize steps 1 and 2, delete will be called on src because instance.host has not yet been updated. As vm_state != RESIZED, confirm_resized will not be called, and files will be left on dest.

If delete is called between resize steps 2 and 4, delete will be called on dest, because instance.host has been updated. As vm_state != RESIZED, confirm_resized will not be called, and files will be left on src.

It should be possible to fix this properly by closing the race condition. The periodic task might remain, but would be redundant in normal operation. The outline of the fix is:

* Remove the confirm_resized call from nova api delete.
* Add a synchronized(instance.uuid) block around RESIZE steps 2 and 3 above (yes, including the cast).
* Add a synchronized(instance.uuid) block around finish_resize
* Add check inside the synchronized blocks in finish_resize for deleted. Fail if deleted.
* Add check inside do_terminate_instance for current migration. If current migration, also cleanup other node following current periodic task.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2015-10-01:

#106

Thanks Matthew for digging that out!

Your scenario seems correct in theory, but is it a race someone can win in practice ?

Is there a reason why the Nova bug report is still New ?

Revision history for this message

Anna Sortland (annasort) wrote on 2015-10-06:

#107

Are we expecting more code changes in nova to address this CVE?

Revision history for this message

Jeremy Stanley (fungi) wrote on 2015-10-06:

#108

It looks like Matthew has changed the status of the bug from fix released to new in an attempt to solicit further input from nova developers, but ultimately this should I think be opened as a separate bug report. If the described scenario can be confirmed exploitable then the VMT would end up requesting a new CVE for an incomplete/partial fix instead.

Revision history for this message

Tony Breeds (o-tony) wrote on 2015-10-07:

#109

I think this can (and should) be moved back to fix released:

the related issues are being tracked in https://bugs.launchpad.net/nova/+bug/1501808

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-07: Fix merged to nova (stable/juno)

#110

Reviewed: https://review.openstack.org/219301
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fa72fb8b51d59e04913c871539cee98a3da79058
Submitter: Jenkins
Branch: stable/juno

commit fa72fb8b51d59e04913c871539cee98a3da79058
Author: Rajesh Tailor <email address hidden>
Date: Wed Mar 4 05:05:19 2015 -0800

Delete orphaned instance files from compute nodes

    While resizing/revert-resizing instance, if instance gets deleted
    in between, then instance files remains either on the source or
    destination compute node.

    To address this issue, added a new periodic task
    '_cleanup_incomplete_migrations' which takes care of deleting
    instance files from source/destination compute nodes and then
    mark migration record as failed so that it doesn't appear again
    in the next periodic task run.

    Conflicts:
     nova/compute/manager.py
     nova/tests/unit/compute/test_compute_mgr.py

    SecurityImpact
    Closes-Bug: 1392527
    Change-Id: I9866d8e32e99b9f907921f4b226edf7b62bd83a7
    (cherry picked from commit 18d6b5cc79973fc553daf7a92f22cce4dc0ca013)

Tristan Cacqueray (tristan-cacqueray) on 2015-10-07

Changed in ossa:
status:	Fix Committed → Fix Released

Revision history for this message

Matthew Booth (mbooth-9) wrote on 2015-10-12:

#111

Hi, I changed the status because I wanted it documented somewhere. I felt that it made more sense to re-open the existing bug than open a new one because the original bug still exists. It was quite a lot of work to first determine what the problem was that the original patch was trying to fix[1], and then convince myself that it was not fixed, and I wanted to ensure that was recorded somewhere. I am not, however, immediately about to work on this myself.

[1] Better commit messages, please!

Thierry Carrez (ttx) on 2015-10-15

Changed in nova:
milestone:	liberty-3 → 12.0.0

Thierry Carrez (ttx) on 2015-10-15

Changed in nova:
status:	New → Fix Released

OpenStack Compute (nova)

[OSSA 2015-017] Deleting instance while resize instance is running leads to unuseable compute nodes (CVE-2015-3280)

Bug Description

CVE References

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Compute (nova)	Fix Released	High	Tristan Cacqueray	OpenStack Compute (nova) 12.0.0 "liberty"
Juno	Fix Released	Undecided	Tony Breeds	OpenStack Compute (nova) 2014.2.4
Kilo	Fix Released	Undecided	Tony Breeds	OpenStack Compute (nova) 2015.1.3
OpenStack Security Advisory	Fix Released	High	Tristan Cacqueray