amdgpu crash on Mantic

Bug #2036742 reported by Paolo Gentili
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

[Impact]

Booting from USB the latest Mantic Desktop daily image (2023-09-20), just after the initial logs, nothing is displayed on screen. The system is still alive since _autoinstall_ works as intended. Once provisioned, the problem is still present.

It seems related to https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/2029396 .

dmesg attached.

[Test Case]

Live boot Ubuntu Mantic Desktop canary (2023-09-19)

[Where Problems Could Occur]

Dell Optiplex 5090
- Intel Core(TM) i7-11700
- Advanced Micro Devices, Inc. [AMD/ATI] - 1002:699f

Tags: mantic
Revision history for this message
Paolo Gentili (pgentili) wrote (last edit ):

Linux 6.3

Revision history for this message
Paolo Gentili (pgentili) wrote :
description: updated
Changed in linux-firmware (Ubuntu):
milestone: none → ubuntu-23.10
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

[ 4.918050] kernel: UBSAN: array-index-out-of-bounds in /build/linux-IPoq5q/linux-6.5.0/drivers/gpu/drm/amd/amdgpu/../pm/powerplay/hwmgr/smu7_hwmgr.c:3669:4

is not good

Revision history for this message
Mario Limonciello (superm1) wrote :

> It seems related to https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/2029396 .

I don't believe these to be related. That issue is specifically with navi3x dGPU, your system has a much older dGPU.

Your 6.3 and 6.5 logs both appear to crash similarly; Do you have a point in time that this system does work correctly?

Can you try a few of the mainline kernels to narrow down? These are the ones that I would think be best candidates:

https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.6-rc2/
https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.54/
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.19.17/
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.132/

affects: linux-firmware (Ubuntu) → linux (Ubuntu)
Revision history for this message
Juerg Haefliger (juergh) wrote :

UBSAN warnings could be a red herring. They've added a compiler flag that complains about flexible arrays if they're declared incorrectly (false positive). Will take a look tomorrow.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2036742

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Mario Limonciello (superm1) wrote :

> [ 5.134271] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [ 5.322247] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [ 5.510230] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.

Is this connected to a KVM? The lack of reading the EDID is concerning.

> UBSAN warnings could be a red herring. They've added a compiler flag that complains about flexible arrays if they're declared incorrectly (false positive). Will take a look tomorrow.

Yeah I agree they're probably a red herring. The actual issue is that UVD IP block fails to init due to a timeout.

[ 6.025262] kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
[ 6.025511] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v6_0> failed -110
[ 6.025661] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.025663] kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[ 6.025737] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

As a potential workaround (this isn't a solution), you might be able to skip the uvd_v6_0 IP block init.
To do this you need to look up which IP block number it is which is from your logs:

[ 4.836457] kernel: [drm] add ip block number 0 <vi_common>
[ 4.836458] kernel: [drm] add ip block number 1 <gmc_v8_0>
[ 4.836459] kernel: [drm] add ip block number 2 <tonga_ih>
[ 4.836459] kernel: [drm] add ip block number 3 <gfx_v8_0>
[ 4.836460] kernel: [drm] add ip block number 4 <sdma_v3_0>
[ 4.836461] kernel: [drm] add ip block number 5 <powerplay>
[ 4.836462] kernel: [drm] add ip block number 6 <dm>
[ 4.836462] kernel: [drm] add ip block number 7 <uvd_v6_0>
[ 4.836463] kernel: [drm] add ip block number 8 <vce_v3_0>

Then you can add "amdgpu.ip_block_mask=0xffffff7f" to your kernel command line to skip IP block 7 (uvd_v6_0).

If that helps the issue then it does confirm the out of bounds check is a red herring and the real issue is the uvd stuff. I'd like to see data points for those other kernels I suggested to narrow down when this problem started.

Revision history for this message
Timo Aaltonen (tjaalton) wrote (last edit ):

Yes it's a KVM of sorts.. But apparently this did work before, so a rough bisect with mainline builds will be conducted.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Paolo Gentili (pgentili) wrote :

The setup involves a custom KVM indeed, I'll attach the EDID file involved in this setup.

I tested everything as requested, no luck unfortunately. The working configuration, which I've now replicated, is involving the OEM image for Ubuntu 20.04 with which the device has been certified.

Please find attached every collected dmesg and the EDID file.

Revision history for this message
Mario Limonciello (superm1) wrote :

> dmesg-ip-block-mask

[ 6.150330] kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vce0 test failed (-110)
[ 6.150581] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <vce_v3_0> failed -110
[ 6.150726] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.150728] kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[ 6.150730] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

It looks like vce IP block also fails if you ignore uvd. We can play whack-a-mole on the others (if you clear bit 8 also) but I suspect this means the issue is not in the IP block but higher up code.

> dmesg-5.15

This one fails differently than all the rest newer ones. It's failing from a missing firmware file

[ 2.842833] kernel: amdgpu 0000:01:00.0: Direct firmware load for amdgpu/polaris12_k_mc.bin failed with error -2
[ 2.842836] kernel: amdgpu: mc: Failed to load firmware "amdgpu/polaris12_k_mc.bin"
[ 2.842841] kernel: [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
[ 2.843033] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -2

Any idea why that's missing?

I noticed that it also failed for i915, it feels like a totally missing firmware package.

[ 2.743220] kernel: i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=mem
[ 2.744329] kernel: i915 0000:00:02.0: Direct firmware load for i915/rkl_dmc_ver2_03.bin failed with error -2
[ 2.744333] kernel: i915 0000:00:02.0: [drm] Failed to load DMC firmware i915/rkl_dmc_ver2_03.bin. Disabling runtime power management.
[ 2.744334] kernel: i915 0000:00:02.0: [drm] DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

> dmesg 5.10.0-1034-oem

Can you see if it keeps working with latest 5.10 OEM? 1057 it looks like is newest.

Revision history for this message
Paolo Gentili (pgentili) wrote :

> Can you see if it keeps working with latest 5.10 OEM? 1057 it looks like is newest.

Yes, it still works with that version.

Revision history for this message
Mario Limonciello (superm1) wrote :

Thanks for checking. So good to know it hasn't regressed from stable in 5.10. Can you redo your 5.15 test with the firmware in place so we can see if we're OK there or not?

Revision history for this message
Mario Limonciello (superm1) wrote :

I don't expect it helps your boot issue, but the UBSAN issue will be fixed by this commit.

https://<email address hidden>/T/#me31ff6b88640b03be1a8edfc6fc8878ac78ca6bb

Please redo the test with 5.15.

Revision history for this message
Paolo Gentili (pgentili) wrote :

> Thanks for checking. So good to know it hasn't regressed from stable in 5.10. Can you redo your 5.15 test with the firmware in place so we can see if we're OK there or not?

Unfortunately still no luck. I booted Focal with 5.10 OEM and then rebooted to Mantic with 5.15.134. The screen freezes at the end of boot logs. Attached dmesg.

Revision history for this message
Mario Limonciello (superm1) wrote :

Something is really fishy here - the 5.15 test is again missing firmware for both i915, amdgpu and ath10k:

[ 2.345119] kernel: i915 0000:00:02.0: Direct firmware load for i915/rkl_dmc_ver2_03.bin failed with error -2
[ 2.345122] kernel: i915 0000:00:02.0: [drm] Failed to load DMC firmware i915/rkl_dmc_ver2_03.bin. Disabling runtime power management.
[ 2.345123] kernel: i915 0000:00:02.0: [drm] DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

[ 2.361656] kernel: amdgpu 0000:01:00.0: Direct firmware load for amdgpu/polaris12_k_mc.bin failed with error -2
[ 2.361658] kernel: amdgpu: mc: Failed to load firmware "amdgpu/polaris12_k_mc.bin"
[ 2.361662] kernel: [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
[ 2.361797] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -2

[ 2.545915] kernel: ath10k_pci 0000:03:00.0: Failed to find firmware-N.bin (N between 2 and 6) from ath10k/QCA6174/hw3.0: -2
[ 2.545920] kernel: ath10k_pci 0000:03:00.0: could not fetch firmware files (-2)
[ 2.545922] kernel: ath10k_pci 0000:03:00.0: could not probe fw (-2)

I think you might be missing another firmware package, did it get split up in mantic?

Revision history for this message
Paolo Gentili (pgentili) wrote :

On Focal, it also works with 5.14.0-1059. dmesg attached.

Revision history for this message
Kristijan Žic  (kristijan-zic) wrote (last edit ):

I think I have the same issue but I'm not sure. Please advise if I can test anything and how?

GPU: AMD Radeon RX Vega 64 Liquid
CPU: AMD Threadripper 1900x
DS: Wayland

With the new installer:
The new installer crashes the entire session when it opens the “Connectivity” screen.

---------------------------------------------------

Having installed the OS using the legacy installer:

About 10s to 30s into launching and using any app the screen freezes and then the screen turns off and comes back to display colourful artifacts while the screen is still frozen.

Here’s an example of what happens when I start using App Store in the attachment.

Revision history for this message
Kristijan Žic  (kristijan-zic) wrote :

Here’s an example of what happens when I start using Brave browser in the attachment.

With X it crashes and then restarts the gdm and brings me to the login screen.
in any case it’s unusable as it crashes maybe 10 seconds into opening any app.

Revision history for this message
Mario Limonciello (superm1) wrote :

This seems like a different bug to me. Can you please open a different report and attach the relevant logs from the journal to it?

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

the firmware on mantic is zstd compressed, so mainline builds from the past can't load the firmware..

Revision history for this message
Mario Limonciello (superm1) wrote :

> the firmware on mantic is zstd compressed, so mainline builds from the past can't load the firmware..

As a hack then maybe just clone https://gitlab.com/kernel-firmware/linux-firmware and put everything (uncompressed) in /lib/firmware/updates/ to get by that.

Or run the check on Jammy instead.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.