Comment 2 for bug 1259879

Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, this was a fun one to debug.

The test was designed to download a fresh cloud image every time, that's why we hadn't noticed the behavior where the VM is "corrupted" after the first boot. This problem is evident when using a local cloud image, as it's a writable disk image so changes are persistent.

What happens is that our test waits a fixed amount of time, then simply terminates qemu forcefully, the equivalent of shutting down the VM ungracefully, which for some reason (outside the scope of our purposes) causes it to spit those errors out when it reboots.

This was easily solved by amending our cloud-init file to have a power_state stanza. This causes the VM to boot, perform the initialization process, then power_state halts it (not powering off, just want it to leave it in a state where yanking the plug won't damage the disk), and then when we do exactly that, the disk is left in a state that will boot successfully the next time.

Once that was fixed, the VM still "failed to boot". The reason is that the script looks for the "END SSH HOST KEYS" string to determine whether the VM booted correctly. However, this string is only produced on the *first* boot, when creating those keys; subsequent boots will NOT have this string. This was fixed by using the "final_message" cloud-init option, so we have a consistent message when boot is completed successfully. I'm still checking for END SSH HOST KEYS to identify the first boot.

With the proposed changes, the VM can be reused many times for testing.

Another option, which I didn't implement, is to simply copy the .img file to a temporary file, and use that to perform the testing. That way we always have the "pristine" image as a starting point. But this is just a proof of concept which I didn't really work on as it felt more complicated.