Mirantis OpenStack

Instance reach ERROR status after resetting compute

Bug #1608446 reported by Georgy Dyuldin on 2016-08-01

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Fix Committed	Medium	Georgy Dyuldin	Mirantis OpenStack 10.0
9.x	Fix Released	Medium	Georgy Dyuldin	Mirantis OpenStack 9.1

Bug Description

Detailed bug description:
After hard resetting compute with booted instance it reach ERROR status

Steps to reproduce:
1. Deploy MOS 9.1 with 3 controllers, 2 computes, 1 cinder nodes
2. Create 2 networks, 2 subnets, router between them
3. Create 2 Cirros instances on computes (1 instance on 1 compute), wait until it will be booted
4. Reset computes (with virsh reset)
5. Wait until computes will be booted
6. Wait some time (about 5 minutes)
7. Check that instances in ACTIVE status and operable

Expected results:
All steps pass without errors, both instances in ACTIVE status at step 7

Actual result:
One of instances in ERROR status on step 7

Reproducibility:
About 30% of cases

Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.1 snapshot #76
- Reference architecture: 3 controllers, 2 computes, 1 cinder nodes
- Network model: VLAN + DVR

Additional information:
Logs: http://paste.openstack.org/show/545085/

Tags:

Revision history for this message

Georgy Dyuldin (g-dyuldin) wrote on 2016-08-01:

fuel-snapshot-2016-08-01_09-32-14.tar.gz Edit (53.1 MiB, application/x-tar)

Vitaly Sedelnik (vsedelnik) on 2016-08-01

Changed in mos:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → MOS Nova (mos-nova)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-08-01:

Based on the similar bug reports:

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1384532
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1513367

the error can actually be misleading and does not necessarily mean /usr/bin/kvm was not found, but rather one of the files used by qemu on start.

We should give it a try on a live environment and check dmesg / perf_events output.

Revision history for this message

Georgy Dyuldin (g-dyuldin) wrote on 2016-08-02:

Download full text (3.6 KiB)

Founded reason and WA.

root@node-1:~# cat /var/log/libvirt/qemu/instance-00000016.log
2016-08-02 08:07:36.000+0000: starting up
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name instance-00000016 -S -machine pc-i440fx-vivid,accel=kvm,usb=off -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 82c4bd46-8658-4670-a804-9d95b3d3963d -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=13.0.0,serial=2b6c635d-89f4-4bbd-90fb-cefa85225b75,uuid=82c4bd46-8658-4670-a804-9d95b3d3963d,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000016.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/82c4bd46-8658-4670-a804-9d95b3d3963d/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=directsync -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/nova/instances/82c4bd46-8658-4670-a804-9d95b3d3963d/disk.config,if=none,id=drive-virtio-disk25,format=raw,cache=directsync -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk25,id=virtio-disk25 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:80:2b:7c,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/82c4bd46-8658-4670-a804-9d95b3d3963d/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on
Domain id=2 is tainted: high-privileges
libvirt: error : unable to set AppArmor profile 'libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d' for '/usr/bin/kvm': No such file or directory
2016-08-02 08:07:36.004+0000: shutting down

Look at /etc/apparmor.d/libvirt/

root@node-1:~# ls -l /etc/apparmor.d/libvirt/
total 36
-rw-r--r-- 1 root root 0 Aug 2 08:06 libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d
-rw-r--r-- 1 root root 931 Aug 2 08:07 libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d.files
-rw-r--r-- 1 root root 266 Aug 2 08:47 libvirt-83159cad-2bc3-4130-980b-f782d1afd92b
-rw-r--r-- 1 root root 952 Aug 2 08:47 libvirt-83159cad-2bc3-4130-980b-f782d1afd92b.files
-rw-r--r-- 1 root root 265 Aug 1 23:20 libvirt-c3c5c167-215c-45c6-be6b-10389c3fb06b
-rw-r--r-- 1 root root 931 Aug 1 23:21 libvirt-c3c5c167-215c-45c6-be6b-10389c3fb06b.files
-rw-r--r-- 1 root root 265 Aug 1 23:24 libvirt-f907d277-b401-4f22-9a7d-0b211c66bdde
-rw-r--r-- 1 root root 952 Aug 1 23:24 libvirt-f907d277-b401-4f22-9a7d-0b211c66bdde.files
-rw-r--r-- 1 root root 314 Jun 3 18:02 TEMPLATE.lxc
-rw-r--r-- 1 root root 164 Jun 3 18:02 TEMPLATE.qemu

File /etc/apparmor.d/libvirt/libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d has zero size.

An...

Founded reason and WA.

root@node-1:~# cat /var/log/libvirt/qemu/instance-00000016.log 
2016-08-02 08:07:36.000+0000: starting up
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name instance-00000016 -S -machine pc-i440fx-vivid,accel=kvm,usb=off -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 82c4bd46-8658-4670-a804-9d95b3d3963d -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=13.0.0,serial=2b6c635d-89f4-4bbd-90fb-cefa85225b75,uuid=82c4bd46-8658-4670-a804-9d95b3d3963d,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000016.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/82c4bd46-8658-4670-a804-9d95b3d3963d/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=directsync -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/nova/instances/82c4bd46-8658-4670-a804-9d95b3d3963d/disk.config,if=none,id=drive-virtio-disk25,format=raw,cache=directsync -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk25,id=virtio-disk25 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:80:2b:7c,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/82c4bd46-8658-4670-a804-9d95b3d3963d/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on
Domain id=2 is tainted: high-privileges
libvirt:  error : unable to set AppArmor profile 'libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d' for '/usr/bin/kvm': No such file or directory
2016-08-02 08:07:36.004+0000: shutting down

Look at /etc/apparmor.d/libvirt/

root@node-1:~# ls -l /etc/apparmor.d/libvirt/
total 36
-rw-r--r-- 1 root root   0 Aug  2 08:06 libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d
-rw-r--r-- 1 root root 931 Aug  2 08:07 libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d.files
-rw-r--r-- 1 root root 266 Aug  2 08:47 libvirt-83159cad-2bc3-4130-980b-f782d1afd92b
-rw-r--r-- 1 root root 952 Aug  2 08:47 libvirt-83159cad-2bc3-4130-980b-f782d1afd92b.files
-rw-r--r-- 1 root root 265 Aug  1 23:20 libvirt-c3c5c167-215c-45c6-be6b-10389c3fb06b
-rw-r--r-- 1 root root 931 Aug  1 23:21 libvirt-c3c5c167-215c-45c6-be6b-10389c3fb06b.files
-rw-r--r-- 1 root root 265 Aug  1 23:24 libvirt-f907d277-b401-4f22-9a7d-0b211c66bdde
-rw-r--r-- 1 root root 952 Aug  1 23:24 libvirt-f907d277-b401-4f22-9a7d-0b211c66bdde.files
-rw-r--r-- 1 root root 314 Jun  3 18:02 TEMPLATE.lxc
-rw-r--r-- 1 root root 164 Jun  3 18:02 TEMPLATE.qemu

File /etc/apparmor.d/libvirt/libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d has zero size.

Another file has next content:

root@node-1:~# cat /etc/apparmor.d/libvirt/libvirt-83159cad-2bc3-4130-980b-f782d1afd92b
#
# This profile is for the domain whose UUID matches this file.
#

#include <tunables/global>

profile libvirt-83159cad-2bc3-4130-980b-f782d1afd92b {
  #include <abstractions/libvirt-qemu>
  #include <libvirt/libvirt-83159cad-2bc3-4130-980b-f782d1afd92b.files>

}

After coping this content to /etc/apparmor.d/libvirt/libvirt-82c4bd46-8658-4670-a804-9d95b3d3963d with replacing uuid, virsh can start this instance.

Sergey Shevorakov (sshevorakov) on 2016-08-02

tags:

added: regression

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-08-08:

Georgy, could you please enable logging in /etc/libvirt/libvirtd.conf (note the *d* suffix - this is the config for the daemon part of libvirt):

log_level = 1
log_outputs="3:syslog:libvirtd 1:file:/var/log/libvirt.log"

restart libvirtd:

sudo service libvirtd restart

and give it another try.

I can't reproduce it on my environment: all 10 booted instance on 2 compute nodes are always resumed properly after hard reset via virt-manager. This seems to be a libvirt issue at this point (nova only tells libvirtd to start a domain, libvirtd is the one that writes the apparmor profile to the file system).

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-08-08:

Is it an automated test case? Are you sure you do not restart libvirtd somewhere in the middle?

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-08-10:

So Anna Babich managed to reproduce this on another environment and this looks interesting.

As Georgy pointed out, the corresponding apparmor profile file is empty after reboot:

http://paste.openstack.org/show/553601/

Pay attention to the file modification time, which says "12.47". The error on start of the VM is the same:

http://paste.openstack.org/raw/553602/

However, if you look at the kern.log, you'll see that the compute node was not up until 12.48:

http://paste.openstack.org/show/553609/

which effectively means the apparmor profile file was created *before* the node was rebooted.

Note there is a number of "^@" symbols in the kern.log - this is how less util displays 0 bytes.

What happens here is that a node is stopped abruptly and the FS buffers are not properly flushed to disks, which basically means some data may get corrupted or be missing after the node gets back online.

This is also confirmed by the nova-compute.log:

http://paste.openstack.org/show/553604/
http://paste.openstack.org/show/553610/ (note ^@)

^ there are no log entries that instance 88d7b107-a956-4949-9b2f-5166684f3fe5 was ever *created* on this node, only that nova tried to resume it after restart of the node.

I think what happens here is that you try to boot a VM in the automated test case, then you stop the node abruptly before libvirt flushes FS cache buffers to disks and that's how the apparmor profile file happens to be empty on the node start, which ultimately fails the VM start.

While we could probably "harden" libvirt to overwrite the profile on domain start, I'd argue that you should modify the test case to do a graceful restart instead (shutdown -r), so that all the processes receive TERM signals and get a chance to flush the FS buffers and do other kind of cleanups before termination.

Changed in mos:
assignee:	MOS Nova (mos-nova) → Georgy Dyuldin (g-dyuldin)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-08-22:

Hi Roman, the idea of the test case is to simulate random hard reboot of hardware servers and we can't just replace this step with graceful restart.

What is the expected behaviour for the scenario with hard restart oh compute nodes? Should VMs be active when compute will be online again?

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-08-22:

Timur,

They should and they *do* become active after hard reset, *unless* the corresponding apparmor profiles are empty after reboot (if FS cache buffers *are not* flushed to disks in time).

I've prepared a patch for libvirt here:

https://review.fuel-infra.org/#/c/24954/

you can give the built packages a try:

http://perestroika-repo-tst.infra.mirantis.net/review/CR-24954/mos-repos/xenial/master/

^ this should help make sure apparmor profiles content is synced to disks early or simply recreated on node boot if they still happen to be empty.

On the other hand, you should understand, that the very same thing happens to VMs disks: if the host node is accidentally reset, then some content may be missing on VMs start. You should account for that.

Timur Nurlygayanov (tnurlygayanov) on 2016-08-22

Changed in mos:
assignee:	Georgy Dyuldin (g-dyuldin) → Roman Podoliaka (rpodolyaka)
status:	Confirmed → In Progress

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-08-22:

Georgy, could you please test the packages above ^ on your environment? (you'll need to update libvirt on all compute nodes and restart libvirtd service)

Again, please be aware, that this only makes libvirt more robust in terms of automatically starting instances in case of such a failure. Should the very same thing happen to VMs disks, the data not flushed to disks before hard reset is lost at this point.

Revision history for this message

Georgy Dyuldin (g-dyuldin) wrote on 2016-08-24:

#10

Roman, i've update all installed libvirt packages with above and next restarts all services:

http://paste.openstack.org/show/562914/

Next, i try to repeat all steps from bug description, and at third try i've catch same error: one of instances in ERROR status.

http://paste.openstack.org/show/562915/

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-08-31:

#11

The additional information was provided, assigned back to MOS Nova team.

MOS Nova team, could you please take a look?

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-01:

#12

Georgy,

I've updated the patch:

https://review.fuel-infra.org/#/c/24954

and tested it on a local environment by manually truncating the apparmor profile of a domain first and then starting it via virsh.

I believe, the updated packages here will help:

http://perestroika-repo-tst.infra.mirantis.net/review/CR-24954/mos-repos/xenial/master/pool/main/libv/libvirt/

^ could you please give it a try?

But I want to stress my point again: the test you perform is *expected* to be disruptive: we can make libvirtd a bit more robust in terms of overwriting empty apparmor profiles, but should the very same thing happen to instance disks, the data not flushed to HDD will be lost.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-01:

#13

Given the fact this can only be reproduced under very specific circumstances (power loss happens *right after* a domain is created in libvirtd) triggered by an automatic test and can easily be worked around by removing empty profile files, I don't think this deserves High importance.

Let's see what we can do in master branch and then back port the change for libvirt to 9.2.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-01:

#14

Also, I'm not sure why this has a "regression" tag: nothing has changed in libvirtd - you must have added a new test case, that triggers this behaviour.

tags:

added: area-linux
removed: regression

Timur Nurlygayanov (tnurlygayanov) on 2016-09-05

tags:

added: area-nova
removed: area-linux

Revision history for this message

Georgy Dyuldin (g-dyuldin) wrote on 2016-09-06:

#15

Hi Roman.

I can't boot instance after updating libvirt. I've checked it twice with reverted environment, and before updating instances booted, after - not.

Additional info: http://paste.openstack.org/show/567094/

Tail of logs from compute: http://termbin.com/gqhn