nova orphan instances
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
In Progress
|
Wishlist
|
Yongli He |
Bug Description
Description
===========
Under some corner conditions, Instances might become orphan: Nova does not aware that instance is running on the host anymore.
Steps to reproduce
==================
1) Suppose nova-compute get down for some reason, and during this downtime period, the user deletes the server by API, then it's records deleted from the DB. After this, nova-compute comes back up again. Now the guest VM is still running on this compute node and consuming resources.
2) During Live-Migration, after the Live-Migration begins, it then runs to completion controlled by libvirt. If something happened to the under-layer infrastructure, eg, rabbitmq dead or networking is terrible congestion, it may not delete the instance on source compute, or it try to rollback but failed, then, there will be 2 of the same instance on both source and destination compute node. On the source host, the instance is a duplication, it's orphan instance for source compute node.
Expected result
===============
There should be no orphan instances.
Actual result
=============
Some instances is out of management of Nova.
Environment
===========
Reproduce such condition is not easy. Refer to discuss on stein meetup:
https:/
Fix
=====
Proposal to add a periodic task which provides what action would be taken if find an orphan instance, suggest action is:
* reap the instance.
* stop the instance.
* log the messages only. [default]
The interval of the periodic task should be configurable.
This was proposed as a Blueprints previously but more qualified as a bug. Refer to:
https:/
Changed in nova: | |
assignee: | nobody → Yongli He (yongli-he) |
status: | New → In Progress |
tags: | added: compute starlingx |
Changed in nova: | |
assignee: | Yongli He (yongli-he) → Eric Fried (efried) |
Changed in nova: | |
assignee: | Eric Fried (efried) → Yongli He (yongli-he) |
Changed in nova: | |
assignee: | Yongli He (yongli-he) → Eric Fried (efried) |
Changed in nova: | |
assignee: | Eric Fried (efried) → Yongli He (yongli-he) |
Note that having nova be able to clean up instances it doesn't know about (nova-manage db archive_ deleted_ rows on deleted!=0 instances) is a fundamental shift from how things have historically been done. Today, nova will not touch instances that do not have records in the database. Scenario: ops engineer creates a libvirt domain on a compute host out-of-band from nova in order to do some testing -- nova-compute will not touch it today. If we make a change to be able to reap them, nova-compute could destroy that testing libvirt domain.
I'm not saying it's necessarily bad, but it's different than what we've been doing. For this reason (and the fact that it will introduce config options), I think it should have a release note.
Also noting that the scenario I describe above seems alleviated by the fact that the default config in the proposed change would be to only log messages if orphans are detected. This way, an operator has to opt-in to having nova-compute destroy instances it doesn't know about and thus engineers in their org should know they can't do out-of-band tests like the one described in the scenario.