[RC] libvirt instance definitions not removed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
justinsb |
Bug Description
In my recent patch to make sure that libvirt instances didn't disappear on reboot, I changed it so that definitions were persistent. However, I didn't consider the consequences of leaving definitions around.
Koji reported the following issues on a MP, I'm pasting them here so that we can track them as a bug and I can work on them:
(1) euca-reboot-
you need to apply Brian's patch before reproducing this issue.
reboot() simply calls following codes,
_create_new_domain causes followig exception because domain is already defined.
libvir: Domain Config error : operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-
2011-04-09 10:29:49,276 ERROR nova.exception [-] Uncaught exception
(nova.exception): TR self.destroy(
ACE: Traceback (most recent call last):
(nova.exception): TRACE: File "/home/
(nova.exception): TRACE: return f(*args, **kw)
(nova.exception): TRACE: File "/home/
(nova.exception): TRACE: self._create_
(nova.exception): TRACE: File "/home/
(nova.exception): TRACE: domain = self._conn.
(nova.exception): TRACE: File "/usr/lib/
(nova.exception): TRACE: if ret is None:raise libvirtError(
(nova.exception): TRACE: libvirtError: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-
(nova.exception): TRACE:
2011-04-09 10:29:49,286 ERROR nova [-] Exception during message handling
(nova): TRACE: Traceback (most recent call last):
(nova): TRACE: File "/home/
(nova): TRACE: rval = node_func(
(nova): TRACE: File "/home/
(nova): TRACE: return f(*args, **kw)
(nova): TRACE: File "/home/
(nova): TRACE: function(self, context, instance_id, *args, **kwargs)
(nova): TRACE: File "/home/
(nova): TRACE: self.driver.
(nova): TRACE: File "/home/
(nova): TRACE: raise Error(str(e))
(nova): TRACE: Error: operation failed: domain 'instance-00000002' already exists with uuid a3a56e76-
(nova): TRACE:
(2) It seems that there are no code calling 'undefine' domain xml. So domain xml is not removed.
for example,
root@ubuntu:
Id Name State
-------
5 instance-00000001 running
root@ubuntu:
root@ubuntu:
Id Name State
-------
- instance-00000001 shut off
root@ubuntu:
I think we could undefine xml definition when we terminate instance-00000001.
And lastly, I have not checked rescue mode is working or not. Does someone know that rescue mode is working properly now?
Related branches
- Sandy Walsh (community): Approve
- Vish Ishaya (community): Approve
-
Diff: 66 lines (+50/-5)1 file modifiednova/virt/libvirt_conn.py (+50/-5)
Changed in nova: | |
assignee: | nobody → justinsb (justin-fathomdb) |
summary: |
- libvirt instance definitions not removed + [RC] libvirt instance definitions not removed |
Changed in nova: | |
milestone: | none → cactus-rc |
importance: | Undecided → High |
status: | New → In Progress |
Changed in nova: | |
status: | In Progress → Fix Committed |
Changed in nova: | |
milestone: | cactus-rc → 2011.2 |
status: | Fix Committed → Fix Released |
Requesting Gamma Freeze exemption...
Benefit: Without this, instance reboot on libvirt backed instances will not work (because it deletes the domain and recreates it - it probably shouldn't do that anyway, but we can't fix that in Cactus). Any function that involves deleting a domain is likely to be broken without it (e.g. recovery), and in addition delete domains accumulate in libvirt (visible in virsh list --all).
Risk of regression: Moderate. This is not a trivial fix, but it's not super complicated either - it is just adding one extra call to "undefine". That one call expands into lots of lines of code because it has to cope if the domain is shutoff but not deleted, so we can't just keep the naive error handling. Mitigating factors:
1) Testing against my own install using KVM, including with instances in the 'stuck' state (shut down but still defined)
2) Very careful error handling code (which we probably should have throughout the libvirt code anyway)
3) Making the new behaviour as close as possible to the old behaviour (e.g. I would like to see restart reuse the domain definition, because then I think e.g. volume attachments would persist; however that would put a much higher workload on QA)