Comment 6 for bug 1535232

Revision history for this message
Daniel Berrange (berrange) wrote :

Looking at logs from this job:

http://logs.openstack.org/83/252283/96/experimental/gate-tempest-dsvm-multinode-live-migration/469db74/

On the target host QEMU logfile we see

> 2016-01-26 10:30:51.895+0000: shutting down
> load of migration failed

The fact that we see the shutting down message before the "load of migration failed" message strongly suggests that libvirt is proactively shutting down the QEMU. ie the failure to load migration data is *not* the cause of the problem - it is just a side effect of libvirt shutting down QEMU.

There is no sign of errors on the dest host libvirt logs - it is simply told by the source host that migration has failed and to shutdown the new QEMU.

Going back to the source host we see migration has started succesfully and transferred alot of data. eg the last status update before failure shows:

2016-01-26 10:30:51.197+0000: 8591: debug : qemuMonitorJSONCommandWithFd:264 : Send command '{"execute":"query-migrate","id":"libvirt-144"}' for write with FD -1
2016-01-26 10:30:51.197+0000: 8591: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f1e0c00db30 msg={"execute":"query-migrate","id":"libvirt-144"}
 fd=-1
2016-01-26 10:30:51.199+0000: 8590: debug : qemuMonitorIOWrite:504 : QEMU_MONITOR_IO_WRITE: mon=0x7f1e0c00db30 buf={"execute":"query-migrate","id":"libvirt-144"}
 len=48 ret=48 errno=11
2016-01-26 10:30:51.199+0000: 8590: debug : qemuMonitorIOProcess:396 : QEMU_MONITOR_IO_PROCESS: mon=0x7f1e0c00db30 buf={"return": {"expected-downtime": 106, "status": "active", "setu
p-time": 2, "total-time": 6613, "ram": {"total": 67776512, "remaining": 323584, "mbps": 95.288421, "transferred": 82742303, "duplicate": 4631, "dirty-pages-rate": 310, "skipped": 0,
"normal-bytes": 83189760, "normal": 20310}}, "id": "libvirt-144"}
 len=312
2016-01-26 10:30:51.199+0000: 8590: debug : qemuMonitorJSONIOProcessLine:157 : Line [{"return": {"expected-downtime": 106, "status": "active", "setup-time": 2, "total-time": 6613, "r
am": {"total": 67776512, "remaining": 323584, "mbps": 95.288421, "transferred": 82742303, "duplicate": 4631, "dirty-pages-rate": 310, "skipped": 0, "normal-bytes": 83189760, "normal"
: 20310}}, "id": "libvirt-144"}]
2016-01-26 10:30:51.199+0000: 8590: debug : qemuMonitorJSONIOProcessLine:177 : QEMU_MONITOR_RECV_REPLY: mon=0x7f1e0c00db30 reply={"return": {"expected-downtime": 106, "status": "active", "setup-time": 2, "total-time": 6613, "ram": {"total": 67776512, "remaining": 323584, "mbps": 95.288421, "transferred": 82742303, "duplicate": 4631, "dirty-pages-rate": 310, "skipped": 0, "normal-bytes": 83189760, "normal": 20310}}, "id": "libvirt-144"}

What is interesting is that we see some I/O error events issued - the first is

2016-01-26 10:30:46.888+0000: 8590: debug : qemuMonitorJSONIOProcessLine:172 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1e0c00db30 event={"timestamp": {"seconds": 1453804246, "microseconds": 883853}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "operation": "write", "action": "report"}}

Initially migration is still continuing despite these I/O error reports

After we have received quite alot of them though, we eventually see migration fail

2016-01-26 10:30:51.251+0000: 8591: debug : qemuMonitorSend:959 : QEMU_MONITOR_SEND_MSG: mon=0x7f1e0c00db30 msg={"execute":"query-migrate","id":"libvirt-145"}
2016-01-26 10:30:51.891+0000: 8590: debug : qemuMonitorIOProcess:396 : QEMU_MONITOR_IO_PROCESS: mon=0x7f1e0c00db30 buf={"return": {"status": "failed"}, "id": "libvirt-145"}

That last 'query-migrate' command took a surprisingly long time to get a reply - 600ms - in between the command & reply we get a tonne of block I/o errors reported.

What I'm not sure of is why the block I/O errors would cause migration to abort - this shouldn't happen unless libvirt is asked for it, and nova doesn't ask that.