Comment 4 for bug 1766543

Revision history for this message
Junien Fridrick (axino) wrote :

OK, the bug happened again with strace attached to nova-compute. Once again, there's little to no IO/network while it happens. memory is stable. CPU is at least 50% idle (and the rest of it largely user mode). Nothing in dmesg.

nova-compute logs are as follow :
2018-04-25 14:48:04.587 54255 INFO nova.virt.libvirt.driver [req-85551d96-713d-499d-b7ff-9f911fb0842d bc0ab055427645aca4ed09266e85b1db 1cb457a8302543fea067e5f14b5241e7 - - -] [instance: bd17aeef-240b-489c-8bb6-b37167155174] Deleting instance files /srv/nova/instances/bd17aeef-240b-489c-8bb6-b37167155174_del

2018-04-25 14:52:06.350 54255 INFO nova.virt.libvirt.driver [req-85551d96-713d-499d-b7ff-9f911fb0842d bc0ab055427645aca4ed09266e85b1db 1cb457a8302543fea067e5f14b5241e7 - - -] [instance: bd17aeef-240b-489c-8bb6-b37167155174] Deletion of /srv/nova/instances/bd17aeef-240b-489c-8bb6-b37167155174_del complete

So it took 4 minutes to remove the files. Looking at strace, it's mostly a single thread doing (unrelated) stuff in the meantime : https://pastebin.canonical.com/p/w63Z62r4zN/ (sorry, Canonical-only link).

The first line is the unlink, and you can see the syscall finishing at the bottom. Looking at this, I'm fairly convinced that it's the unlink() itself that took 4 minutes. So perhaps a kernel bug ?

I'm now running "perf record sleep 60" in a loop to try and diagnose what the hell the system is doing during that time.