MAAS

Quanta D52B-1U unable to PXE-boot in EFI mode

Bug #1752687 reported by Rod Smith on 2018-03-01

This bug report is a duplicate of: Bug #1437353: UEFI network boot hangs at grub for adapter 82599ES 10-Gigabit SFI/SFP+. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Incomplete	Undecided	Unassigned
	grub2-signed (Ubuntu)	New	Undecided	Unassigned

Bug Description

Certification recently received a Quanta D52B-1U server (jehan). This server enlists, commissions, and deploys fine in BIOS/CSM/legacy mode; however, in EFI/UEFI mode, it fails, hanging at the "grub>" prompt; here's a capture from an IPMI SOL session:

>>Checking Media Presence......
>>Media Present......
>>Start PXE over IPv4. Press ESC key to abort PXE boot.
Station IP address is 10.1.10.164

  Server IP address is 10.1.10.2
  NBP filename is bootx64.efi
  NBP filesize is 1196736 Bytes

>>Checking Media Presence......
>>Media Present......
Downloading NBP file...

Succeed to download NBP file.
Fetching Netboot Image

GNU GRUB version 2.02~beta2-36ubuntu3.16

   Minimal BASH-like line editing is supported. For the first word, TAB
   lists possible command completions. Anywhere else TAB lists possible
   device or file completions.

grub> ls
(memdisk) (hd0) (hd1) (hd1,msdos1) (hd2) (hd2,gpt1)
grub> ls (memdisk)/
grub.cfg
grub> less (memdisk)/grub.cfg
error: can't find command `less'.
grub> cat (memdisk)/grub.cfg
if [ -e $prefix/x86_64-efi/grub.cfg ]; then
source $prefix/x86_64-efi/grub.cfg
else
source $prefix/grub.cfg
fi

grub>

The interaction at the end ("ls" and other commands at the "grub>" prompt) is me trying to identify the GRUB environment; these commands were not entered automatically.

I've tried dozens of combinations of firmware settings (enabling and disabling various options), with no success; the system always seems to fail at the same point.

I've been unable to enlist the node in EFI mode; and once enlisted in BIOS mode, commissioning it in EFI mode fails. Because commissioning sets up partitions, including the critical EFI System Partition (ESP), which are different between BIOS- and EFI-mode boots, I have been unable to test deployment in EFI mode.

Other systems have commissioned and deployed from this MAAS server both shortly before and after discovering the problem with jehan.

This bug is similar in symptoms to several other "hang-at-grub>" bugs, such as bug #1690878 and bug #1437024; however, I have yet to find a workaround, and this problem occurs 100% of the time.

I'm attaching the contents of /var/log/maas/* to this bug report. Here's the MAAS version information:

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-==================================================
ii maas 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cert-server 0.2.30-0~76~ubuntu16.04.1 all Ubuntu certification support files for MAAS server
ii maas-cli 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Tags:

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-03-01:

/var/log/maas directory tree from the server Edit (98.4 MiB, application/x-tar)

tags:

added: hwcert-server

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-03-01:

Jeff lane has noted that in the logs, the system PXE-boots off a8:1e:84:f2:96:c5 (which MAAS identifies as enp59s0f0), whereas in EFI mode, it uses a8:1e:84:f2:96:c6 (enp59s0f1).

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-03-01:

Hi Rod,

Is this with Secure boot enabled? Could you also please attach:

Tcpdump of the pxe process for the specific MAC address.

That said, if there are other servers that boot fine in EFI mode, this, to me, would normally indicate that there is a bug in grub itself or in the firmware. I’m opening a task against grub.

Changed in maas:
status:	New → Incomplete

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-03-01:

@Rod,

MAAS doesn’t make any differentiation on what PXE interface the machine boots on provided that we identify the MAC address and send the configuration for it. The raw tcpdump should give us more info.

Please also attach the full console log.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2018-03-01:

Rod said this might be critical, so I'll add it for reference...

From what I could tell in rackd.log, every EFI boot always comes from a8:1e:84:f2:96:c6 and every BIOS PXE boot comes from a8:1e:84:f2:96:c5.

rackd.log:2018-03-01 12:32:26 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c6
rackd.log:2018-03-01 12:32:26 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by a8:1e:84:f2:96:c6
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] ldlinux.c32 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/01-a8-1e-84-f2-96-c5 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0A010AA4 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0A010AA requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0A010A requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0A010 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0A01 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0A0 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0A requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/0 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/default requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] ifcpu64.c32 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] libcom32.c32 requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:42 provisioningserver.rackdservices.tftp: [info] ubuntu/amd64/hwe-16.04/xenial/daily/boot-kernel requested by a8:1e:84:f2:96:c5
rackd.log:2018-03-01 12:36:43 provisioningserver.rackdservices.tftp: [info] ubuntu/amd64/hwe-16.04/xenial/daily/boot-initrd requested by a8:1e:84:f2:96:c5

Rod said this might be critical, so I'll add it for reference...

From what I could tell in rackd.log, every EFI boot always comes from a8:1e:84:f2:96:c6 and every BIOS PXE boot comes from a8:1e:84:f2:96:c5.

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-03-01:

tcpdump capture of an EFI-mode commissioning attempt. Edit (18.7 MiB, application/vnd.tcpdump.pcap)

Here's a tcpdump capture of a commissioning attempt. I don't know what you mean by "full console log." The original bug report includes a full capture of the IPMI SOL console, and there is no "logs" tab in the MAAS web UI for the node.

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-03-01:

Oh, this was all with Secure Boot disabled, as far as I can tell. (I see no Secure Boot options in the server's setup screens.)

Revision history for this message

Jeff Lane  (bladernr) wrote on 2018-03-01:

I got into the system for a bit and tried forcing it to PXE off of each NIC, both cases resulted in the same thing... PXE request for bootx64.efi is made (and apparently succeeds), and then request for grubx64.efi happens and we eventually get dumped to a grub prompt.

I did notice this, though, in both cases the request for bootx64.efi happens twice... is this expected:

2018-03-01 15:33:32 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c6
2018-03-01 15:33:32 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c6
2018-03-01 15:33:32 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by a8:1e:84:f2:96:c6
2018-03-01 15:36:19 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c5
2018-03-01 15:36:19 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c5
2018-03-01 15:36:19 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by a8:1e:84:f2:96:c5

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-03-01:

I've perused the tcpdump output. It looks like jehan is requesting TFTP data (GRUB) from weavile (the MAAS server), and that goes OK; then there's a LONG string of exchanges like this (copied from the wireshark summary screen):

6399 109.865912 QuantaCo_f2:96:c5 Broadcast ARP 60 Who has 10.1.10.2? Tell 10.1.10.164
6400 109.865935 HewlettP_f5:69:f1 QuantaCo_f2:96:c5 ARP 42 10.1.10.2 is at 48:0f:cf:f5:69:f1

This goes on for about 31 seconds, according to the time stamps. The mapping of IP addresses and MAC addresses looks correct (weavile is 10.1.10.2 and 48:0f:cf:f5:69:f1; jehan is a8:1e:84:f2:96:c5 and a8:1e:84:f2:96:c6, using whatever IP address weavile gives it, which does seem to be 10.1.10.164 on this run.

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-03-02:

#10

So in that pcap when I look at it with vim, I see things like:
¤Û^V^Fj^B^L*Å^@^C^C<9a>ons^@feature_200_final^@feature_nativedisk_cmd^@feature_timeout_style^@^@Possible commands are:^@Possible devices are:^@Possible files are:^@Possible partitions are:^@Possible arguments are:^@Possible things are:^@ %s^@ ^@
^@%s/x86_64-efi/command.lst^@module isn't loaded^@superusers^@user '%s' not found^@Enter username: ^@^H^@%c^@
^@Enter password: ^@access denied^@^@Check whether user is in USERLIST.^@[USERLIST]^@authenticate^@%s/x86_64-efi/fs.lst^@Warning: syntax error (missing slash) in `%s'
^@Warning: invalid foreground color `%s'

Which eventually:

¤Û^V^Fj^B^L*Å^@^C^C<9b>arning: invalid background color `%s'
^@black^@blue^@green^@cyan^@red^@magenta^@brown^@light-gray^@dark-gray^@light-blue^@light-green^@light-cyan^@light-red^@light-magenta^@yellow^@white^@)^@^@%s,%s^@" ^@' ^@ ^@.^@..^@%s/^@set^@-u^@--help^@--usage^@--%s^@/^@Sunday^@Monday^@Tuesday^@Wednesday^@Thursday^@Friday^@Saturday^@
^@Falling back to `%s'^@

^@
^@Press any key to continue...^@
^@Failed to boot both default and fallback entries.
^@timeout^@%d^@default^@timeout_style^@%d ^@theme^@gfxmenu^@module `%s' isn't loaded^@gfxterm^@fallback^@chosen^@0^@boot^@menu^M]<98>Zyñ^N^@<^@^@^@<^@^@^@H^OÏõiñ¨^^<84>ò<96>Å^H^@E^@^@ d6^@^@@^Qíï

Could this be that the firmware is the issue here?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-03-02:

#11

Looking at the logs, I also see this:

2018-03-01 12:31:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c5
2018-03-01 12:31:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c5
2018-03-01 12:31:13 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by a8:1e:84:f2:96:c5
2018-03-01 12:32:25 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c6
2018-03-01 12:32:26 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by a8:1e:84:f2:96:c6
2018-03-01 12:32:26 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by a8:1e:84:f2:96:c6

Idk if that's the firmware doing it automatically (e.g. trying to boot from one interface, and then another) or someone manually is doing it, but judging for the logs i would imagine one of two things:

1. The firmware is taking a long time to download the requested file and timesout after 30 seconds
2. The firmware just doesn't really do anything once it downloads the file.

I'm assuming that other EFI machines work just fine, I would say this could even be a firmware related issue. Can you confirm other EFI machines work?

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-03-02:

#12

Andres, the fragments you're seeing in the pcap file look like GRUB code to me. Keep in mind that's just a raw capture of EVERYTHING that passed over the MAAS server's local interface for the time during which I was booting the node, so that includes the GRUB binary that was passed from the MAAS server to the node, and even a few interactions with unrelated machines. You'd need to use wireshark or something similar to properly decode it.

As noted in my comment #9, above, it looks to me like ARP responses from the MAAS server are getting lost by the node AFTER GRUB has been (presumably successfully, although I've not tried to analyze the tcpdump results to verify this) delivered to the node. This could be a GRUB bug, a Quanta firmware bug, or a bug in how the two interact with each other. Then too, my knowledge of how they all interact is limited, so my conclusion may be in error.

As to the rest, both Jeff and I tried completely disabling each of the machine's two network ports (they're 10Gbps fiber connections, FWIW). In some of my own tests, I tried typing "exit" at the "grub>" prompt. When both ports were enabled, this resulted in a second boot attempt, presumably one from each device. The TFTP request log snippet you've quoted looks consistent with one of those runs.

Other EFI machines do work fine; I deployed two yesterday, for instance. (They were both Quanta machines, but a different model.)

Revision history for this message

Jeff Lane  (bladernr) wrote on 2018-05-03:

#13

Just for grins, Rod, do we have any 1Gb PCIe cards we could throw in Jehan (when you're in Lex next time, or maybe sfeole can) to see if we can get it to PXE from SOMETHING in efi mode?

Also, you flashed the system firmware, how about the firmware for the NICs?

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-05-04:

#14

This may be a duplicate of bug #1437353.