OK, the bug happened again with strace attached to nova-compute. Once again, there's little to no IO/network while it happens. memory is stable. CPU is at least 50% idle (and the rest of it largely user mode). Nothing in dmesg.
nova-compute logs are as follow :
2018-04-25 14:48:04.587 54255 INFO nova.virt.libvirt.driver [req-85551d96-713d-499d-b7ff-9f911fb0842d bc0ab055427645aca4ed09266e85b1db 1cb457a8302543fea067e5f14b5241e7 - - -] [instance: bd17aeef-240b-489c-8bb6-b37167155174] Deleting instance files /srv/nova/instances/bd17aeef-240b-489c-8bb6-b37167155174_del
2018-04-25 14:52:06.350 54255 INFO nova.virt.libvirt.driver [req-85551d96-713d-499d-b7ff-9f911fb0842d bc0ab055427645aca4ed09266e85b1db 1cb457a8302543fea067e5f14b5241e7 - - -] [instance: bd17aeef-240b-489c-8bb6-b37167155174] Deletion of /srv/nova/instances/bd17aeef-240b-489c-8bb6-b37167155174_del complete
So it took 4 minutes to remove the files. Looking at strace, it's mostly a single thread doing (unrelated) stuff in the meantime : https://pastebin.canonical.com/p/w63Z62r4zN/ (sorry, Canonical-only link).
The first line is the unlink, and you can see the syscall finishing at the bottom. Looking at this, I'm fairly convinced that it's the unlink() itself that took 4 minutes. So perhaps a kernel bug ?
I'm now running "perf record sleep 60" in a loop to try and diagnose what the hell the system is doing during that time.
OK, the bug happened again with strace attached to nova-compute. Once again, there's little to no IO/network while it happens. memory is stable. CPU is at least 50% idle (and the rest of it largely user mode). Nothing in dmesg.
nova-compute logs are as follow : libvirt. driver [req-85551d96- 713d-499d- b7ff-9f911fb084 2d bc0ab055427645a ca4ed09266e85b1 db 1cb457a8302543f ea067e5f14b5241 e7 - - -] [instance: bd17aeef- 240b-489c- 8bb6-b371671551 74] Deleting instance files /srv/nova/ instances/ bd17aeef- 240b-489c- 8bb6-b371671551 74_del
2018-04-25 14:48:04.587 54255 INFO nova.virt.
2018-04-25 14:52:06.350 54255 INFO nova.virt. libvirt. driver [req-85551d96- 713d-499d- b7ff-9f911fb084 2d bc0ab055427645a ca4ed09266e85b1 db 1cb457a8302543f ea067e5f14b5241 e7 - - -] [instance: bd17aeef- 240b-489c- 8bb6-b371671551 74] Deletion of /srv/nova/ instances/ bd17aeef- 240b-489c- 8bb6-b371671551 74_del complete
So it took 4 minutes to remove the files. Looking at strace, it's mostly a single thread doing (unrelated) stuff in the meantime : https:/ /pastebin. canonical. com/p/w63Z62r4z N/ (sorry, Canonical-only link).
The first line is the unlink, and you can see the syscall finishing at the bottom. Looking at this, I'm fairly convinced that it's the unlink() itself that took 4 minutes. So perhaps a kernel bug ?
I'm now running "perf record sleep 60" in a loop to try and diagnose what the hell the system is doing during that time.