There is a race condition in when launching multiple instances over NFS simultaneously that can end up causing the os.utime function to fail when updating the mtime for the image base:
2018-12-15 14:22:38.740 7 INFO nova.virt.libvirt.driver [req-d33edf35-733b-4591-831c-666cd159cee1 8965b22a11c44875a90fe88f50769a5a b9644067db0d44789e19d9d032287ada - default default] [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] Creating image
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [req-d33edf35-733b-4591-831c-666cd159cee1 8965b22a11c44875a90fe88f50769a5a b9644067db0d44789e19d9d032287ada - default default] [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] Instance failed to spawn: OSError: [Errno 13] Permission denied
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] Traceback (most recent call last):
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2252, in _build_resources
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] yield resources
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2032, in _build_and_run_instance
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] block_device_info=block_device_info)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3091, in spawn
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] block_device_info=block_device_info)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3469, in _create_image
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] fallback_from_host)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3560, in _create_and_inject_local_root
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] instance, size, fallback_from_host)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7634, in _try_fetch_image_cache
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] size=size)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 243, in cache
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] *args, **kwargs)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 601, in create_image
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] nova.privsep.path.utime(base)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 207, in _wrap
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] return self.channel.remote_call(name, args, kwargs)
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 202, in remote_call
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] raise exc_type(*result[2])
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552] OSError: [Errno 13] Permission denied
2018-12-15 14:22:38.747 7 ERROR nova.compute.manager [instance: 6fec5d88-09ab-4ecc-815d-c08c298fe552]
The NFS settings are correct:
[root@overcloud-novacompute-0 ~]# mount | grep nova
overcloud-controller-0.opnfvlf.org:/nova on /var/lib/nova/instances type nfs4 (rw,relatime,sync,context=system_u:object_r:nfs_t:s0,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,nosharecache,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.0.2.44,local_lock=none,addr=192.0.2.40,_netdev)
[root@overcloud-novacompute-0 ~]# getenforce
Permissive
[root@overcloud-novacompute-0 ~]# ls -lrt /var/lib/nova/
total 0
drwxrwxrwx. 4 42436 42436 53 Dec 14 19:52 instances
[root@overcloud-novacompute-0 ~]# ls -lrt /var/lib/nova/instances/
total 4
drwxr-xr-x. 2 42436 42436 54 Dec 14 19:16 _base
drwxr-xr-x. 2 42436 42436 93 Dec 14 19:52 locks
-rw-r--r--. 1 42436 42436 58 Dec 14 20:33 compute_nodes
[root@overcloud-novacompute-0 ~]# ls -lrt /var/lib/nova/instances/_base/
total 13384
-rw-r--r--. 1 42427 42427 46137344 Dec 14 19:18 78c6ae575fd026484b1f46588b5c337b3fd53558
[root@overcloud-novacompute-0 ~]# docker exec -it bf4658c561e1 /bin/bash
()[nova@overcloud-novacompute-0 /]$ id -u nova
42436
()[nova@overcloud-novacompute-0 /]$ id -u qemu
42427
()[nova@overcloud-novacompute-0 /]$ ls -lrt /var/lib/nova/instances/
total 4
drwxr-xr-x. 2 nova nova 54 Dec 14 19:16 _base
drwxr-xr-x. 2 nova nova 93 Dec 14 19:52 locks
-rw-r--r--. 1 nova nova 58 Dec 14 20:33 compute_nodes
()[nova@overcloud-novacompute-0 /]$ ls -lrt /var/lib/nova/instances/_base/
total 13384
-rw-r--r--. 1 qemu qemu 46137344 Dec 14 19:18 78c6ae575fd026484b1f46588b5c337b3fd53558
I have been regularly hitting this too, in a test suite (for the Calico networking driver) using Queens.
The typical pattern for me is:
- launch an instance on a compute node (which is an NFS client); this succeeds
- after 20-30s, launch another instance on the same compute node; this fails with the same call stack as above.
But I guess that's close enough for us to be talking about the same root cause here.
The biggest thing that I don't understand about this is: why aren't there loads of OpenStack users hitting this all the time? Do most OpenStack deployment using Queens or later not use NFS for instance ephemeral storage? If not, is that because they don't care about live migration, or because they're getting that some other way?
(I should add: I'm pretty sure this problem was introduced going from Pike to Queens. I run the same test suite regularly on Pike and Queens, and the Pike test never sees a problem like this; and the nova.privsep. path.utime code was introduced between Pike and Queens.)