HVM/SSD Xenial AWS AMI doesn't launch on r3.large

Bug #1582776 reported by Jay Berkenbilt
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Xenial
Fix Committed
High
Joseph Salisbury
Yakkety
Fix Released
High
Joseph Salisbury

Bug Description

**NOTE** The additional information attached to this bug report is for my laptop and does not apply to this bug report. Using ubuntu-bug, I'm not sure how to control this. Using the debian bug tool, I would have just edited the file....

If you attempt to launch an ec2 instance using

IMAGE ami-c1cb23ac 099720109477/ubuntu/images-testing/hvm-ssd/ubuntu-xenial-daily-amd64-server-20160516.1 099720109477 available public x86_64 machine

on an instance of type r3.large, the kernel doesn't boot. You get output such as the attached log (kernel-ami-c1cb23ac.log).

You get essentially the same result if you try to launch an image from ami-840910ee which is the AMI you get from http://cloud-images.ubuntu.com/locator/ with the search string "16.04 hvm us-east ssd" and picking the latest version.

HVM is supposed to work on r3.large. We have used this successfully with Ubuntu 12.04, 14.04, and 15.10 and also with CentOS 5, 6, and 7. r3.large is supposed to work with both HVM and PV. We have no problems with this AMI on m4 or c4 instances which only support HVM.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-22-generic 4.4.0-22.40
ProcVersionSignature: Ubuntu 4.4.0-22.39-generic 4.4.8
Uname: Linux 4.4.0-22-generic x86_64
ApportVersion: 2.20.1-0ubuntu2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ejb 14140 F.... pulseaudio
Date: Tue May 17 11:06:11 2016
HibernationDevice: RESUME=UUID=1c991c18-29ed-4f63-b481-89ef40b91b94
InstallationDate: Installed on 2016-04-22 (24 days ago)
InstallationMedia: Xubuntu 16.04 LTS "Xenial Xerus" - Release amd64 (20160420.1)
IwConfig:
 docker0 no wireless extensions.

 enp0s3 no wireless extensions.

 lo no wireless extensions.
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 002: ID 80ee:0021 VirtualBox USB Tablet
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: innotek GmbH VirtualBox
ProcFB: 0 vboxdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-22-generic root=/dev/mapper/jblin0-root_16_04 ro quiet splash
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-22-generic N/A
 linux-backports-modules-4.4.0-22-generic N/A
 linux-firmware 1.157
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/01/2006
dmi.bios.vendor: innotek GmbH
dmi.bios.version: VirtualBox
dmi.board.name: VirtualBox
dmi.board.vendor: Oracle Corporation
dmi.board.version: 1.2
dmi.chassis.type: 1
dmi.chassis.vendor: Oracle Corporation
dmi.modalias: dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
dmi.product.name: VirtualBox
dmi.product.version: 1.2
dmi.sys.vendor: innotek GmbH

Revision history for this message
Jay Berkenbilt (ejb) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the 16.04 kernel against other instance types to see if the bug happens with them as well?

The latest 16.04 kernel can be downloaded from:
https://launchpad.net/~canonical-kernel-security-team/+archive/ubuntu/ppa/+build/9734451

With this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

If the bug is tied to the kernel, we could the test other kernel versions to narrow down the last good version and first bad version. That will allow us to perform a kernel bisect to find the exact commit that caused this.

Thanks in advance!

Revision history for this message
Jay Berkenbilt (ejb) wrote :

I know it works on c4 and m4 instance types. Those are the only other ones I've tried. To test this kernel on r3, I'd probably have to launch an instance on m4, upgrade the kernel from above, create a new AMI, and try running that on r3. I'm in major crunch right now, so I can't stop what I'm doing to do that, but I can try to squeeze it in while I'm waiting for other stuff to run, etc. We're upgrading our infrastructure to 16.04 and, for now, I'm just staying away from the r3 instances, but it will be important for us to support them at some point.

I believe I may have misspoken about r3 supporting PV. c3 supports both PV and HVM. r3 only supports HVM.

tags: added: kernel-key
Revision history for this message
Stefan Bader (smb) wrote :

I wonder how AWS manages to get that odd (literally) number of CPUs. From the dmesg:

smpboot: Allowing 15 CPUs, 13 hotplug CPUs

From that it looks like possible=15 and num_processors=2 in prefill_possible_,map(). And possible is assigned to nr_cpu_ids after printing the line above. Confirmed in dmesg a little down:

setup_percpu: NR_CPUS:256 nr_cpumask_bits:256 nr_cpu_ids:15 nr_node_ids:1

The divide error is in smp_init_package_map() which is inlined from smp_store_boot_cpu_info(). This would print "Max logical packages: ..." at some point which is not in dmesg. The only places there that look like they could cause a divide by 0 would be using ncpus. The first instance here:

ncpus = boot_cpu_data.x86_max_cores;
__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);

Have not yet dug down into where x86_max_cores gets set.

Revision history for this message
Stefan Bader (smb) wrote :

OK, looks like we need at least:

commit 56402d63eefe22179f7311a51ff2094731420406
Author: Thomas Gleixner <email address hidden>
Date: Fri May 6 20:48:16 2016 +0200

    x86/topology: Handle CPUID bogosity gracefully

    Joseph reported that a XEN guest dies with a division by 0 in the package
    topology setup code. This happens if cpu_info.x86_max_cores is zero.

That at least will avoid the crash. But the change only sets ncpus to 1 if it is 0. That still might lead to an unexpected number of available CPUs...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That commit also fixes bug 1573231. I have a test kernel already built for that bug if you want to test it. It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1573231/

Revision history for this message
Jay Berkenbilt (ejb) wrote :

I will test it and post results here.

Revision history for this message
Jay Berkenbilt (ejb) wrote :

Yes, I am able to boot an r3.large instance using this kernel, and I see the correct number of CPUs. My procedure, in case anyone else wants to reproduce it, was to boot up from the latest test AMI (ami-c1cb23ac), sudo dpkg -i ./linux-headers-4.4.0-21* linux-image-* as downloaded from your link, and manually edit /boot/grub/grub.cfg to put this kernel first since it is an earlier version than 4.4.0-22, which is what was previously installed. I rebooted to make sure I got the new kernel and that the m4.large still worked. Then I stopped the instance, changed its instance type to r3.large, and started it. It came up fine, and /proc/cpuinfo shows the expected number of CPUs.

Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
tags: removed: kernel-key
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.