arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-aws (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
nvidia-graphics-drivers-525 (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
nvidia-graphics-drivers-525-server (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
nvidia-graphics-drivers-535 (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
nvidia-graphics-drivers-535-server (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance.
To reproduce using the generic kernel:
# Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge.
# Install the linux generic kernel from lunar-updates:
$ sudo DEBIAN_
# Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel)
$ sudo DEBIAN_
$ reboot
# Install the Nvidia 535-server driver DKMS package:
$ sudo DEBIAN_
# Enable the driver
$ sudo modprobe nvidia
# At this point the system will hang and never return.
# A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached):
[ 1.964942] nvidia: loading out-of-tree module taints kernel.
[ 1.965475] nvidia: module license 'NVIDIA' taints kernel.
[ 1.965905] Disabling lock debugging due to kernel taint
[ 1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 2.012715]
[ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/
[ 62.026516] (detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4)
[ 62.027018] Task dump for CPU 3:
[ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144 flags:0x0000000e
[ 62.028066] Call trace:
[ 62.028273] __switch_
[ 62.028567] 0x228
Timed out for waiting the udev queue being empty.
Timed out for waiting the udev queue being empty.
[ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/
[ 242.046373] (detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4)
[ 242.046874] Task dump for CPU 3:
[ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144 flags:0x0000000f
[ 242.047922] Call trace:
[ 242.048128] __switch_
[ 242.048417] 0x228
Timed out for waiting the udev queue being empty.
Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215]
[ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu
[ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018
[ 384.004715] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 384.005259] pc : smp_call_
[ 384.005683] lr : smp_call_
[ 384.006108] sp : ffff8000089a3a70
[ 384.006381] x29: ffff8000089a3a70 x28: 0000000000000003 x27: ffff00056d1fafa0
[ 384.006954] x26: ffff00056d1d76c8 x25: ffffc87cf18bdd10 x24: 0000000000000003
[ 384.007527] x23: 0000000000000001 x22: ffff00056d1d76c8 x21: ffffc87cf18c2690
[ 384.008086] x20: ffff00056d1fafa0 x19: ffff00056d1d76c0 x18: ffff80000896d058
[ 384.008645] x17: 0000000000000000 x16: 0000000000000000 x15: 617362755f5f0073
[ 384.009209] x14: 0000000000000001 x13: 0000000000000006 x12: 4630354535323145
[ 384.009779] x11: 0101010101010101 x10: ffffb78318e9c0e0 x9 : ffffc87ceeac7da4
[ 384.010339] x8 : ffff00056d1d76f0 x7 : 0000000000000000 x6 : 0000000000000000
[ 384.010894] x5 : 0000000000000004 x4 : 0000000000000000 x3 : ffff00056d1fafa8
[ 384.011464] x2 : 0000000000000003 x1 : 0000000000000011 x0 : 0000000000000000
[ 384.012030] Call trace:
[ 384.012241] smp_call_
[ 384.012635] kick_all_
[ 384.012961] flush_module_
[ 384.013294] load_module+
[ 384.013588] __do_sys_
[ 384.013944] __arm64_
[ 384.014306] invoke_
[ 384.014613] el0_svc_
[ 384.015000] do_el0_
[ 384.015280] el0_svc+0x30/0xe0
[ 384.015540] el0t_64_
[ 384.015896] el0t_64_
This same procedure impacts the 525, 525-server, 535 and 535-server drivers. It does *not* hang a similarly configured host running focal or jammy.
summary: |
- Host hangs during modprobe nvidia on lunar and mantic + arm64 AWS host hangs during modprobe nvidia on lunar and mantic |
Although nvidia seems to be the trigger here, the crashing code appears to be pure generic linux: arch/arm64/ kernel/ syscall. c