Comment 2 for bug 1827238

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: 2.6beta2: many nodes failed deployment with time out

So this is what I see on the logs:

1. On rackd.log on .32, I see the machine PXE boot to start the deployment process:

2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by 10.244.41.7
2019-05-01 10:32:36 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by 10.244.41.7
2019-05-01 10:32:58 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.244.41.7

2. On rackd.log on .30, I see it pxe boot post-deployment (and its told to localboot):

2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7

3. I see that curtin has run the deployment process and hasn't reported any errors - log: https://pastebin.ubuntu.com/p/zMgTttxdSj/ | curtin config: https://pastebin.ubuntu.com/p/Y2ZMX6Rstd/

So, from all the information above, I don't think we have enough information to know what the issue is.

A. The machine was never instructed to localboot.
B. The machine was instructed to localboot, but grub failed.
C. The machine booted onto the disk, but either didn't get network or failed to contact metadata.
D. There is a firmware issue preventing the machine from accessing the deployed environment.

Looking at the looks, it seems that:

A -> the machine did indeed reboot and accessed the grub config and instructed to localboot.
B -> We don't know if grub failed, because we have no console logs.
C -> Could be the case that network was not configured properly on reboot either due to cloud-init or a bug in netplan. For this we need console logs.
D -> WE need console logs.

So from all the info here, I'm marking this bug as incomplete as we would really need console logs to determine what's the issues. That said, these could also be curtin issues when, while it succeeded, it could have caused have misconfigured something for which the machine never really boot into the installed environment. So, I'm adding curtin to see if they can help us.

@Jason, quick q, are the machines that failed to boot all grub? it seems to me that's the case but just want to double check.

Lastly, we would really need console logs. @Jason, you can setup conserver to automatically gather the logs from the console and share those with MAAS.