Seeing this here:
http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/new/screen-n-cpu.txt.gz#_2017-07-24_16_41_40_410
2017-07-24 16:41:40.410 31027 ERROR nova.compute.manager [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Unexpected error during post live migration at destination host.: InstanceNotFound: Instance 39285c11-48d7-4ae6-be42-5b49826ea380 could not be found.
2017-07-24 16:41:40.411 31027 DEBUG nova.compute.manager [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Checking state _get_power_state /opt/stack/new/nova/nova/compute/manager.py:1142
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] Exception during message handling: InstanceNotFound: Instance 39285c11-48d7-4ae6-be42-5b49826ea380 could not be found.
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 76, in wrapped
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server self.force_reraise()
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 67, in wrapped
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/utils.py", line 864, in decorated_function
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 199, in decorated_function
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 5713, in post_live_migration_at_destination
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server 'destination host.', instance=instance)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server self.force_reraise()
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 5708, in post_live_migration_at_destination
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server block_device_info)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7117, in post_live_migration_at_destination
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server guest = self._host.get_guest(instance)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 534, in get_guest
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server return libvirt_guest.Guest(self.get_domain(instance))
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 555, in get_domain
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server raise exception.InstanceNotFound(instance_id=instance.uuid)
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server InstanceNotFound: Instance 39285c11-48d7-4ae6-be42-5b49826ea380 could not be found.
2017-07-24 16:41:40.560 31027 ERROR oslo_messaging.rpc.server
This is in a Pike CI job.
It fails here:
https://github.com/openstack/nova/blob/87a0143267743e884c60f3c93f80d8fdea441322/nova/virt/libvirt/driver.py#L7117
The TODO after that seems to suggest we should expect the guest domain is persisted on the host by libvirtd at this point.
I'm pretty sure this is a duplicate but where something actually fails on the source side once we consider the live migration complete so it doesn't actually end up on the destination host.
This is the resulting error back on the source host when post live migration at destination fails:
http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/old/screen-n-cpu.txt.gz#_2017-07-24_16_41_40_557
And right before the source calls post live migration at destination, I see this error in the source host logs:
http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/old/screen-n-cpu.txt.gz#_2017-07-24_16_41_39_717
2017-07-24 16:41:39.717 19889 ERROR nova.virt.libvirt.driver [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Live Migration failure: Unable to read from monitor: Connection reset by peer
2017-07-24 16:41:39.718 19889 DEBUG nova.virt.libvirt.driver [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Migration operation thread notification thread_finished /opt/stack/old/nova/nova/virt/libvirt/driver.py:6527
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
timer()
File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
cb(*args, **kw)
File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 168, in _do_send
waiter.switch(result)
File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
result = function(*args, **kwargs)
File "/opt/stack/old/nova/nova/utils.py", line 1087, in context_wrapper
return func(*args, **kwargs)
File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 6179, in _live_migration_operation
instance=instance)
File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 6172, in _live_migration_operation
bandwidth=CONF.libvirt.live_migration_bandwidth)
File "/opt/stack/old/nova/nova/virt/libvirt/guest.py", line 623, in migrate
destination, params=params, flags=flags)
File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
result = proxy_call(self._autowrap, f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
rv = execute(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
six.reraise(c, e, tb)
File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
rv = meth(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1674, in migrateToURI3
if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer
And right before that, it looks like the driver considered the live migration complete:
http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/old/screen-n-cpu.txt.gz#_2017-07-24_16_41_39_702
2017-07-24 16:41:39.702 19889 INFO nova.virt.libvirt.driver [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] Migration operation has completed
2017-07-24 16:41:39.702 19889 INFO nova.compute.manager [req-815b2093-b72d-4ef5-a8e5-b00113e0a688 tempest-LiveMigrationTest-911619613 tempest-LiveMigrationTest-911619613] [instance: 39285c11-48d7-4ae6-be42-5b49826ea380] _post_live_migration() is started..
And the failure is in the qemu domain log:
http://logs.openstack.org/66/483566/10/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/0437fbe/logs/subnode-2/libvirt/qemu/instance-00000011.txt.gz
2017-07-24 16:41:39.286+0000: initiating migration
qemu-system-x86_64: /build/qemu-orucB6/qemu-2.8+dfsg/block/io.c:1514: bdrv_co_pwritev: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
2017-07-24 16:41:39.699+0000: shutting down, reason=crashed
Which was actually reported in April in bug 1685340. From Kashyap it sounded like something was trying to write to the image on the source host while it was being migrated.
I'm not sure if there is something we can do on the source host to detect this or work around it?
http:// logstash. openstack. org/#dashboard/ file/logstash. json?query= message% 3A%5C%22Live% 20Migration% 20failure% 3A%20Unable% 20to%20read% 20from% 20monitor% 3A%20Connection %20reset% 20by%20peer% 5C%22%20AND% 20tags% 3A%5C%22screen- n-cpu.txt% 5C%22&from= 7d