Nodes stuck at grub menu when attempting to deploy

Bug #1532935 reported by Chris Gregan
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
curtin
Invalid
Undecided
Unassigned
grub
New
Undecided
Unassigned
maas-images
Confirmed
Undecided
Unassigned
grub2 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Build Version/Date: Current revision: 9592
MAAS Version 1.9.1+bzr4543-0ubuntu1 (trusty1)
Juju-core 1.25.3-0ubuntu1~15.10.1~juju1

Environment used for testing:
Trusty

Summary:
When autopilot attempts to deploy Openstack, 1-2 nodes fail to deploy. Investigation reveals that they are booting into a grub menu. It seems the MAAS directed boot is not completed properly. This failure seems to only happen when Autopilot is deploying the nodes. When deploying nodes using MAAS directly, this issue is not re-producible.

Steps to Reproduce:
1) Provide Landscape with credentials to MAAS with 8 physical nodes
2) Select Ceph/Swift configuration
3) Deploy and monitor MAAS node status

Expected result:
Nodes deploy to fully working HA Openstack

Actual result:
A couple nodes fail to deploy and are stuck at a grub menu.

Tags: cdo-qa
Revision history for this message
Chris Gregan (cgregan) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Hm, juju 1.25.1 is known to have issues with MAAS, so much that it was never released.

In particular:
https://bugs.launchpad.net/juju-core/+bug/1525280
https://bugs.launchpad.net/maas/+bug/1519527
https://bugs.launchpad.net/juju-core/+bug/1520199

Juju 1.25.2 would be a better candidate to try now, it's in juju's proposed PPA. You will have to change landscape code a tiny bit though, ping me in #landscape for that (until https://bugs.launchpad.net/landscape/+bug/1531601 is fixed, feel free to add heat to that one).

Before trying with juju 1.25.2, you may have to clean some stale DNS entries in MAAS, though, but I think you redeploy MAAS everytime, right?

Changed in landscape:
status: New → Incomplete
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Oh, wait a sec, the cloud deploy was done using 1.25.0, not 1.25.1... I see that in the juju status output...

Back to the drawing board.

Changed in landscape:
status: Incomplete → New
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Do you have maas logs corresponding to this failed OSA deploy?

Changed in landscape:
status: New → Incomplete
Chris Gregan (cgregan)
Changed in landscape:
status: Incomplete → Invalid
Revision history for this message
Chris Gregan (cgregan) wrote :

Started importing boot images.
Mar 7 19:10:27 dratini maas.lease_upload_service: [ERROR] Failed to upload leases: 'str' object has no attribute 'mac'
Mar 7 19:10:42 dratini maas.node: [INFO] lairon: Status transition from DEPLOYING to FAILED_DEPLOYMENT

Changed in landscape:
status: Invalid → Confirmed
description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

That "object has no attribute" error is from maas, correct? Looks like it's a bug there?

Changed in landscape:
status: Confirmed → Incomplete
Revision history for this message
Chris Gregan (cgregan) wrote :

It does seem that way......moving to MAAS for further investigation

information type: Proprietary → Public
affects: landscape → maas
Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
John George (jog) wrote :

Nodes intermittently halt at the grub2 menu on the first boot immediately after the deployment has completed (i.e. the installation step completes). This is seen only when deploying Trusty and seems to be related to the node previously being installed with Xenial. However, the halt at the grub menu is not consistently reproducible.

The simpler reproduction steps are to use MAAS 1.9 with Trusty and Xenial images:
1. Deploy a node with Xenial, from the MAAS UI
2. Release the node installed with Xenial
3. Deploy the same node with Trusty, from the MAAS UI

After the Trusty image is deployed the system will stop at the grub menu. There is no grub menu item to select the freshly installed Trusty image.

ii maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server provisioning libraries

Revision history for this message
John George (jog) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

This seems that curtin in trusty / ppa for 1.9 is not doing the right thing when deploying xenial!

Changed in maas:
status: Confirmed → Incomplete
Revision history for this message
Ryan Harper (raharper) wrote :

Can you please attach:

maas [session] node get-curtin-config [system-id]
maas [session] interfaces read [system-id]

And confirm the curtin version installed with the 1.9 maas from ppa?

Changed in curtin:
status: New → Incomplete
Chris Gregan (cgregan)
summary: - Nodes stuck at grub menu when attempting to Autopilot deploy
+ Nodes stuck at grub menu when attempting to deploy
Revision history for this message
John George (jog) wrote :

The curtin packages installed on the MAAS 1.9 server are:
curtin-common/trusty,now 0.1.0~bzr359-0ubuntu1 all [installed,automatic]
python-curtin/trusty,now 0.1.0~bzr359-0ubuntu1 all [installed,automatic]

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Changed in curtin:
status: Incomplete → Confirmed
Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Ryan Harper (raharper) wrote :

Do we have the curtin install log from the successful install but then hang at boot?

Thanks

Revision history for this message
John George (jog) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

I've used the storage configuration specified to run a Xenial install, and then re-use the same disks to perform a trusty install on those images.

At least under VMs, I cannot reproduce this issue.

Our next best course of action is to install Xenial to the system as normal. Then attempt the trusty install, which should fail as indicated, and then install Xenial again (but use the second disk (/dev/sdb?) as the boot device instead of sda (or even the NVME device).

Note, during the final install, you'll need to ensure that you don't wipe sda; altogether, it would be best if you overrode the curtin storage config to not enumerate the the original install disk (curtin will ignore it) and then we can mount it up and inspect it.

Revision history for this message
John George (jog) wrote :

Ran into this issue multiple times while using Landscape Autopilot to deploy Openstack, which was never able to complete due to at least one machine failing to deploy, after getting stuck at the grub menu. Juju 1.25.5 and MAAS 1.9.2 were being used.

Once stopped at the grub menu I was able to bring the system up for inspection by doing the following:

    - Reboot the server and drop into the boot menue (F11 on HP)
    - Select the UEFI module option
    - Use the file explorer to drill down to and run shimx64.efi
    - grub> cat /boot/grub/menu.lst (to see configured kernel and initrd paths)
    - grub> linux /boot/vmlinuz-3.13.0-86-generic root=LABEL=root ro console=ttyS1,38400 1
    - grub> initrd /boot/initrd.img-3.13.0-86-generic
    - grub> boot
    - Boot would stop at run level 1 and drop to a root shell prompt
    - /etc/init.d/networking start
    - service ssh start
    - ifconfig (to find configured IP)
    - ssh into the system as the ubuntu user

During the Openstack deploy attempts, 4 different machines hit this issue while others installed and booted successfully. Trusty 14.04.4 was being install on all machines. Two identical HP servers had the following differences:

1. The failing server's disk was configured with a 'gpt' partition table, while the
 successful server was configured with a 'msdos' partition table.

2. The failing server had two partitions defined (partition 1 has efi files) :
  1 1049kB 538MB 537MB fat32 boot
  2 538MB 500GB 500GB ext4

   The successful server had only one partition defined:
   1 1049kB 500GB 500GB primary ext4

3. The failing server's /boot directory is configured for EFI and the successful server is not.

Seems MAAS, Curtin or Grub is inconsistently making decisions about how the partitions should be defined and whether or not to use EFI. When EFI is used grub does not boot the server.

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :

Failed server:

(parted) print
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
 1 1049kB 538MB 537MB fat32 boot
 2 538MB 500GB 500GB ext4

Successfully server:

(parted) print
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
 1 1049kB 500GB 500GB primary ext4

Revision history for this message
Ryan Harper (raharper) wrote :

Can you attach from the failed server:

1. /boot/efi/EFI/ubuntu/grub.cfg
2. ls -al /dev/disk/by-uuid/ > ls_disk_by_uuid
3. efibootmgr > efibootmgr.out

Revision history for this message
John George (jog) wrote :

As requested on IRC:

ubuntu@spinda:~/.ssh$ cat /etc/fstab
UUID=af491a56-82d6-4fe1-8049-ca7f5f1667ef / ext4 defaults 0 0
UUID=0C7A-B252 /boot/efi vfat defaults 0 0
/swap.img none swap sw 0 0

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :

ubuntu@spinda:/boot$ ls -al /dev/disk/by-uuid/
total 0
drwxr-xr-x 2 root root 80 May 19 23:14 .
drwxr-xr-x 7 root root 140 May 19 23:14 ..
lrwxrwxrwx 1 root root 10 May 19 23:14 0C7A-B252 -> ../../sda1
lrwxrwxrwx 1 root root 10 May 19 23:14 af491a56-82d6-4fe1-8049-ca7f5f1667ef -> ../../sda2
ubuntu@spinda:/boot$ efibootmgr
ubuntu@spinda:/boot$

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :

I've attached the curtin-install-cfg.yaml that's left behind in /root of the installed system.
Node the storage layout differences between failed_deploy_curtin-install-cfg.yaml and successful_deploy_curtin-install-cfg.yaml

Revision history for this message
Ryan Harper (raharper) wrote :

These are different systems:

Failed:
  http://10.245.208.27/MAAS/metadata/latest/by-id/node-d469af2a-1d33-11e6-a26c-ecb1d7731018/

Success:
  http://10.245.208.27/MAAS/metadata/latest/by-id/node-d2ecaaf8-1d33-11e6-8546-ecb1d7731018/

So, if that's the exact same system, something is really strange with the disks:

In the failed, it includes the following disks:

sda:
  serial: 600508b1001c9896485a7e5e6cdb7f49
sdb:
  serial: 600508b1001cf9578ccaa1accdfda06c
nvme0n1:
  path: /dev/nvme0n1

On Success the storage is different:

sda:
  serial: 600508b1001c52b43e40b184a8929e2a
sdb:
  serial: 600508b1001c99b5a4632b00b8e7bafc

Revision history for this message
John George (jog) wrote :

There are not the same system, just identically configured systems, commissioned at the same time.

Revision history for this message
John George (jog) wrote :

FWIW here is a full console output capture from the time MAAS was asked to deploy a server to the time it booted after installation and stopped at the grub menu.

Revision history for this message
Ryan Harper (raharper) wrote :

Could you try appending this:

echo 'configfile $prefix/grub.cfg' | sudo tee -a /boot/efi/EFI/ubuntu/grub.cfg

All of the UEFI systems I've got have a EFI grub.cfg that looks like this:

% cat /boot/efi/EFI/ubuntu/grub.cfg
search.fs_uuid 6894d00f-75af-4a05-bdda-530beea1c491 root hd0,gpt2
set prefix=($root)'/grub'
configfile $prefix/grub.cfg

and I noticed that the one from comment #28 does not.

Also, if we can confirm (I think it matches) that the fs_uuid value in grub.cfg matches the root partition (/dev/sda2).

Revision history for this message
John George (jog) wrote :

Both servers, that are stopping at the grub menu, indeed do not have the 'configfile $prefix/grub.cfg' line in their /boot/efi/EFI/ubuntu/grub.cfg file.

Adding that line allows them to boot without stopping at the grub prompt.

I released both of these server and re-deployed Trusty from the MAAS. Again they stopped at grub and did not have the configfile line in grub.cfg. Manually triggering the server to come up with:

grub> linux /boot/vmlinuz-3.13.0-86-generic root=LABEL=root ro console=ttyS1,38400
grub> initrd /boot/initrd.img-3.13.0-86-generic
grub> boot

Once the system was up, the MAAS state changed to deployed. Any re-boots would stop of the grub menu until I manually add the 'configfile $prefix/grub.cfg' line to the /boot/efi/EFI/ubuntu/grub.cfg file.

Ryan you appear to have identified the issue here.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1532935] Re: Nodes stuck at grub menu when attempting to deploy

Cool!

Now the question is where in grub-install from grub-efi package does it
write out that file and why would it not include that last line; seems
boilerplate to me.

You might want to target this bug against the grub package.

On Fri, May 20, 2016 at 5:26 PM, John George <email address hidden>
wrote:

> Both servers, that are stopping at the grub menu, indeed do not have the
> 'configfile $prefix/grub.cfg' line in their
> /boot/efi/EFI/ubuntu/grub.cfg file.
>
> Adding that line allows them to boot without stopping at the grub
> prompt.
>
> I released both of these server and re-deployed Trusty from the MAAS.
> Again they stopped at grub and did not have the configfile line in
> grub.cfg. Manually triggering the server to come up with:
>
> grub> linux /boot/vmlinuz-3.13.0-86-generic root=LABEL=root ro
> console=ttyS1,38400
> grub> initrd /boot/initrd.img-3.13.0-86-generic
> grub> boot
>
> Once the system was up, the MAAS state changed to deployed. Any re-boots
> would stop of the grub menu until I manually add the 'configfile
> $prefix/grub.cfg' line to the /boot/efi/EFI/ubuntu/grub.cfg file.
>
> Ryan you appear to have identified the issue here.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1532935
>
> Title:
> Nodes stuck at grub menu when attempting to deploy
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1532935/+subscriptions
>

Revision history for this message
Blake Rouse (blake-rouse) wrote :

This is actually an issue with the grub package that is installed to the disk not the grub that MAAS used to UEFI boot, deploy, and chainload grub.

This is actually an issue in grub in Ubuntu, not MAAS.

no longer affects: grub
Changed in maas:
status: Confirmed → Invalid
Changed in curtin:
status: Confirmed → Invalid
Changed in maas-images:
status: New → Confirmed
Revision history for this message
Blake Rouse (blake-rouse) wrote :

This will affect maas-images, which will need to be updated once a new grub is placed into main.

Can you test if this is an issue on Xenial as well?

Revision history for this message
John George (jog) wrote :

Our test environment frequently installs Xenial but we only seen this issue with Trusty.

Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Could you please try a few things?

1) type 'normal' at the prompt; we should make sure that the grub.cfg config is read file and the menu can show in this case.

2) Are you installing with hwe-xenial kernel? If not, could you please try this?

3) If none of this works; we'll get you a debug version of grub2 or just grub-install for testing with some extra logging, so we can see what happens when it attempts to write that last line to grub.cfg in the EFI path.

Revision history for this message
John George (jog) wrote :

Typing 'normal' at the prompt does boot the system.

So far a trusty kernel has been used during install. I'll see if there is a hwe-xenial kernel available through MaaS, although this issue is intermittent, so all I could do is run with it for awhile and see if the issue reproduces.

Revision history for this message
Christian Reis (kiko) wrote :

John, the Xenial kernel should be available if you enable the daily stream.

Revision history for this message
John George (jog) wrote :

When switching to the Xenial kernel the Trusty installation fails even earlier, complaining about 'unrecognised disk label' during the installation phase of the deploy.

Revision history for this message
John George (jog) wrote :

For comparison here is the console log after switching back to the Trusty kernel.

Revision history for this message
John George (jog) wrote :

What can be done to further investigate this bug? It causes frequent MAAS deployment failures in our environment. I believe this bug should be critical or high, since it blocks Landscape Autopilot from successfully installing an Openstack cloud.

Chris Gregan (cgregan)
tags: removed: cdo-qa-blocker
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grub2 (Ubuntu):
status: New → Confirmed
To post a comment you must log in.