CPU lockup on HP Proliant DL380 Gen9 servers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Won't Fix
|
High
|
Unassigned |
Bug Description
Over the past 3-ish weeks we've had 3 seperate HP Proliant DL380 Gen9 servers lock up with a similar looking cpu lockup bug. All 3 of these servers are nova-compute nodes in an OpenStack cluster, with a reasonable amount of load on them. The symptoms are the load shoots up into the hundreds, and ps stops returning.
I've attached lspci -vnvn and 3 sets of syslog message traces that we grabbed on each of the 3 times it has crashed.
$ cat /proc/version_
Ubuntu 3.16.0-
Please let us know if you need any further information.
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Sep 29 06:47 seq
crw-rw---- 1 root audio 116, 33 Sep 29 06:47 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.15
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=
MachineType: HP ProLiant DL380 Gen9
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=xterm
PATH=(custom, no user)
XDG_RUNTIME_
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 1.127.15
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.16.0-49-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dialout libvirtd lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 05/06/2015
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:
dmi.product.name: ProLiant DL380 Gen9
dmi.sys.vendor: HP
tags: |
added: kernel-key removed: kernel-da-key |
tags: |
added: kernel-da-key removed: kernel-key |
The first of the lockups. These all required us to hard reset the servers via the ilo.