Nova instance not boot after host restart but still show as Running

Bug #1408176 reported by Zhenzan Zhou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned
Juno
Fix Released
Undecided
Unassigned

Bug Description

The nova host lost power and after restarted, the previous running instance is still shown in
"Running" state but actually not started:

root@allinone-controller0-esenfmnxzcvk:~# nova list
+--------------------------------------+------------------------------------+--------+------------+-------------+---------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------------------------------+--------+------------+-------------+---------------------------------------+
| 13d9eead-191e-434e-8813-2d3bf8d3aae4 | alexcloud-controller0-rr5kdtqmv7qz | ACTIVE | - | Running | default-net=172.16.0.15, 30.168.98.61 |
+--------------------------------------+------------------------------------+--------+------------+-------------+---------------------------------------+
root@allinone-controller0-esenfmnxzcvk:~# ps -ef |grep -i qemu
root 95513 90291 0 14:46 pts/0 00:00:00 grep --color=auto -i qemu

Please note the resume_guests_state_on_host_boot flag is False. Log file is attached.

Revision history for this message
Zhenzan Zhou (zhenzan-zhou) wrote :
Alex Xu (xuhj)
Changed in nova:
assignee: nobody → Alex Xu (xuhj)
Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

I have a customer reporting a slightly different issue where resume_guests_state_on_host_boot=True but it doesn't matter because the power_state from the driver (libvirt) is the same as the power_state in the database so it's ignored on _init_instance() in the compute manager.

The recreate is:

- create an instance, wait for it to be ACTIVE
- reboot the host/hypervisor
  - libvirt will shutdown the guest which emits the STOPPED lifecycle event from the libvirt driver to the compute manager
  - the handle_lifecycle_event code in the compute manager will call _sync_instance_power_state which will see that the vm_state
    is ACTIVE but the vm_power_state (from the driver) is 4 (shutdown) so it will stop the instance via the stop API which sets the
    vm_state to stopped.
- once the host/hypervisor is back up, libvirt will automatically restart the guest VM which triggers a lifecycle event from the
  libvirt driver to the compute manager. Since the vm_state is STOPPED from before, nova assumes the user wants the instance to
  be stopped and calls the stop API to shutdown the instance again.

Even with resume_guests_state_on_host_boot=True, the libvirt driver's method for resume_state_on_host_boot will ignore the operation if the guest VM is running.

Revision history for this message
Tan Lin (tan-lin-good) wrote :

I try to reproduce this bug, but it looks fine for my environment.
I create a all-in-one node with my PC, it's Ubuntu 14.04

1. resume_guests_state_on_host_boot= False
 - create an instance and it's ACTIVE
 - reboot the host
 - once host is up, instance is ACTIVE at first. But then becomes SHUTOFF at end.
 - It's enable to start the instance.

2. resume_guests_state_on_host_boot= True
 - create an instance and it's ACTIVE
 - reboot the host
 - once host is up, instance becomes Active.

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Tan, it's racy, see some other details in bug 1293480.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Reported bug 1444630 for comment 2.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/174069

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/174069
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0
Submitter: Jenkins
Branch: master

commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 15 11:51:26 2015 -0700

    compute: stop handling virt lifecycle events in cleanup_host()

    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.

    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).

    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.

    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.

    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176

    Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/174477

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/174477
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b19764d2c6a8160102a806c1d6811c4182a8bac8
Submitter: Jenkins
Branch: stable/kilo

commit b19764d2c6a8160102a806c1d6811c4182a8bac8
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 15 11:51:26 2015 -0700

    compute: stop handling virt lifecycle events in cleanup_host()

    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.

    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).

    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.

    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.

    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176

    Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9
    (cherry picked from commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0)

tags: added: in-stable-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/179284

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)
Download full text (18.1 KiB)

Reviewed: https://review.openstack.org/179284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5228d4e418734164ffa5ccd91d2865d9cc659c00
Submitter: Jenkins
Branch: master

commit 906ab9d6522b3559b4ad36d40dec3af20397f223
Author: He Jie Xu <email address hidden>
Date: Thu Apr 16 07:09:34 2015 +0800

    Update rpc version aliases for kilo

    Update all of the rpc client API classes to include a version alias
    for the latest version implemented in Kilo. This alias is needed when
    doing rolling upgrades from Kilo to Liberty. With this in place, you can
    ensure all services only send messages that both Kilo and Liberty will
    understand.

    Closes-Bug: #1444745

    Conflicts:
     nova/conductor/rpcapi.py

    NOTE(alex_xu): The conflict is due to there are some logs already added
    into the master.

    Change-Id: I2952aec9aae747639aa519af55fb5fa25b8f3ab4
    (cherry picked from commit 78a8b5802ca148dcf37c5651f75f2126d261266e)

commit f191a2147a21c7e50926b288768a96900cf4c629
Author: Hans Lindgren <email address hidden>
Date: Fri Apr 24 13:10:39 2015 +0200

    Add security group calls missing from latest compute rpc api version bump

    The recent compute rpc api version bump missed out on the security group
    related calls that are part of the api.

    One possible reason is that both compute and security group client side
    rpc api:s share a single target, which is of little value and only cause
    mistakes like this.

    This change eliminates future problems like this by combining them into
    one to get a 1:1 relationship between client and server api:s.

    Change-Id: I9207592a87fab862c04d210450cbac47af6a3fd7
    Closes-Bug: #1448075
    (cherry picked from commit bebd00b117c68097203adc2e56e972d74254fc59)

commit a2872a9262985bd0ee2c6df4f7593947e0516406
Author: Dan Smith <email address hidden>
Date: Wed Apr 22 09:02:03 2015 -0700

    Fix migrate_flavor_data() to catch instances with no instance_extra rows

    The way the query was being performed previously, we would not see any
    instances that didn't have a row in instance_extra. This could happen if
    an instance hasn't been touched for several releases, or if the data
    set is old.

    The fix is a simple change to use outerjoin instead of join. This patch
    includes a test that ensures that instances with no instance_extra rows
    are included in the migration. If we query an instance without such a
    row, we create it before doing a save on the instance.

    Closes-Bug: #1447132
    Change-Id: I2620a8a4338f5c493350f26cdba3e41f3cb28de7
    (cherry picked from commit 92714accc49e85579f406de10ef8b3b510277037)

commit e3a7b83834d1ae2064094e9613df75e3b07d77cd
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Apr 23 02:18:41 2015 +0000

    Updated from global requirements

    Change-Id: I5d4acd36329fe2dccb5772fed3ec55b442597150

commit 8c9b5e620eef3233677b64cd234ed2551e6aa182
Author: Divya <email address hidden>
Date: Tue Apr 21 08:26:29 2015 +0200

    Control create/delete flavor api permissions using policy.json

    The permissions of ...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/192244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/192244
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7bc4be781564c6b9e7a519aecea84ddbee6bd935
Submitter: Jenkins
Branch: stable/juno

commit 7bc4be781564c6b9e7a519aecea84ddbee6bd935
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 15 11:51:26 2015 -0700

    compute: stop handling virt lifecycle events in cleanup_host()

    When rebooting a compute host, guest VMs can be getting shutdown
    automatically by the hypervisor and the virt driver is sending events to
    the compute manager to handle them. If the compute service is still up
    while this happens it will try to call the stop API to power off the
    instance and update the database to show the instance as stopped.

    When the compute service comes back up and events come in from the virt
    driver that the guest VMs are running, nova will see that the vm_state
    on the instance in the nova database is STOPPED and shut down the
    instance by calling the stop API (basically ignoring what the virt
    driver / hypervisor tells nova is the state of the guest VM).

    Alternatively, if the compute service shuts down after changing the
    intance task_state to 'powering-off' but before the stop API cast is
    complete, the instance can be in a strange vm_state/task_state
    combination that requires the admin to manually reset the task_state to
    recover the instance.

    Let's just try to avoid some of this mess by disconnecting the event
    handling when the compute service is shutting down like we do for
    neutron VIF plugging events. There could still be races here if the
    compute service is shutting down after the hypervisor (e.g. libvirtd),
    but this is at least a best attempt to do the mitigate the potential
    damage.

    Closes-Bug: #1444630
    Related-Bug: #1293480
    Related-Bug: #1408176

    Conflicts:
     nova/compute/manager.py
     nova/tests/unit/compute/test_compute_mgr.py

    Change-Id: I1a321371dff7933cdd11d31d9f9c2a2f850fd8d9
    (cherry picked from commit d1fb8d0fbdd6cb95c43b02f754409f1c728e8cd0)

tags: added: in-stable-juno
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
assignee: Alex Xu (xuhj) → nobody
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.