amdgpu reset during usage of firefox
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Linux |
Unknown
|
Unknown
|
|||
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
mesa (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Running nightly on 23.10 (since monday), I have been experiencing a few amdgpu resets in the past hours
ProblemType: Bug
DistroRelease: Ubuntu 23.10
Package: linux-image-
ProcVersionSign
Uname: Linux 6.5.0-9-generic x86_64
ApportVersion: 2.27.0-0ubuntu5
Architecture: amd64
CasperMD5CheckR
CurrentDesktop: ubuntu:GNOME
Date: Thu Oct 19 18:26:43 2023
HibernationDevice: RESUME=
InstallationDate: Installed on 2022-07-04 (472 days ago)
InstallationMedia: Ubuntu 22.04 LTS "Jammy Jellyfish" - Release amd64 (20220419)
MachineType: {report[
ProcEnviron:
LANG=fr_FR.UTF-8
PATH=(custom, no user)
SHELL=/bin/bash
TERM=xterm-
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=
RelatedPackageV
linux-
linux-
linux-firmware 20230919.
SourcePackage: linux
UpgradeStatus: Upgraded to mantic on 2023-10-16 (3 days ago)
dmi.bios.date: 05/15/2023
dmi.bios.release: 1.24
dmi.bios.vendor: LENOVO
dmi.bios.version: R1MET54W (1.24 )
dmi.board.
dmi.board.name: 21A0CTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: Not Defined
dmi.chassis.
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.
dmi.ec.
dmi.modalias: dmi:bvnLENOVO:
dmi.product.family: ThinkPad P14s Gen 2a
dmi.product.name: 21A0CTO1WW
dmi.product.sku: LENOVO_
dmi.product.
dmi.sys.vendor: LENOVO

|
#5 |

|
#6 |
This is more likely a mesa issue than a kernel issue.

|
#7 |
I will try to test with amdgpu-pro sometimes this week with the kernel that I mentioned above. If the application works as expected, it could be an issue with mesa opengl bug.

|
#8 |
(In reply to Alex Deucher from comment #1)
> This is more likely a mesa issue than a kernel issue.
no, 4.14 kernel with latest mesa libs works very vell without any stucks
but from 4.20.4 and in all latest kernels (including 5.0) OS freezes and stucks every 30s ... 1min for 30s when browsing youtube with HW acceleration enabled(uvd) or playing a game, RX550, Arch, vanilla kernel
365.021164] amdgpu: [powerplay]
[ 365.045198] [drm:amdgpu_
[ 365.570667] amdgpu: [powerplay]
[ 366.115228] [drm:amdgpu_
[ 366.115377] [drm:amdgpu_
[ 366.115388] [drm] Timeout, but no hardware hang detected.
[ 366.689407] amdgpu: [powerplay]
[ 367.232287] amdgpu: [powerplay]
[ 367.787043] amdgpu: [powerplay]
[ 368.320138] amdgpu: [powerplay]
[ 369.367739] amdgpu: [powerplay]
[ 369.907559] amdgpu: [powerplay]
[ 370.994478] amdgpu: [powerplay]
[ 371.538753] amdgpu: [powerplay]
[ 372.075079] amdgpu: [powerplay]
[ 372.598565] amdgpu: [powerplay]
[ 373.657188] amdgpu: [powerplay]
[ 374.198637] amdgpu: [powerplay]
[ 375.075076] [drm:amdgpu_
[ 375.284948] amdgpu: [powerplay]
[ 375.830347] amdgpu: [powerplay]
[ 376.138428] [drm:amdgpu_
[ 376.138783] [drm:amdgpu_
[ 376.138797] [drm] IP block:sdma_v3_0 is hung!
[ 376.138809] [drm] GPU recovery disabled.
[ 376.394657] amdgpu: [powerplay]
[ 376.934375] amdgpu: [powerplay]
[ 377.463230] amdgpu: [powerplay]
[ 377.977725] amdgpu: [powerplay]
[ 378.518406] amdgpu: [powerplay]
[ 379.060098] amdgpu: [powerplay]
[ 379.556880] amdgpu: [powerplay]
[ 380.075217] amdgpu: [powerp...

|
#9 |
Can you bisect?

|
#10 |
I'm having a very similar issue, running Linux Mint 19.1. The issue has persisted from at least 4.15, I'm currently running 5.0.1 and the issue remains.
Here is the latest syslog of the error:
[37258.615599] gmc_v9_
[37258.615608] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615615] amdgpu 0000:06:00.0: in page starting at address 0x0000800107805000 from 27
[37258.615619] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615629] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615633] amdgpu 0000:06:00.0: in page starting at address 0x0000800107807000 from 27
[37258.615636] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615645] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615648] amdgpu 0000:06:00.0: in page starting at address 0x0000800107801000 from 27
[37258.615651] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615660] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615663] amdgpu 0000:06:00.0: in page starting at address 0x0000800107803000 from 27
[37258.615666] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615675] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615678] amdgpu 0000:06:00.0: in page starting at address 0x0000800107809000 from 27
[37258.615681] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615689] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615692] amdgpu 0000:06:00.0: in page starting at address 0x000080010780b000 from 27
[37258.615695] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615704] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615707] amdgpu 0000:06:00.0: in page starting at address 0x0000800107805000 from 27
[37258.615710] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615740] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615743] amdgpu 0000:06:00.0: in page starting at address 0x0000800107807000 from 27
[37258.615746] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615756] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615759] amdgpu 0000:06:00.0: in page starting at address 0x0000800107801000 from 27
[37258.615762] amdgpu 0000:06:00.0: VM_L2_PROTECTIO
[37258.615771] amdgpu 0000:06:00.0: [gfxhub] VMC page fau...

|
#11 |
tried linux-amd-
Apr 01 21:01:03 kernel: amdgpu 0000:03:00.0: [drm:amdgpu_
Apr 01 21:01:03 kernel: [drm:amdgpu_
Apr 01 21:01:03 kernel: [drm:amdgpu_
Apr 01 20:26:59 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 01 20:26:59 kernel: vga_switcheroo: detected switching method \_SB_.PCI0.
Apr 01 20:26:59 kernel: [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874 0x1025:0x1201 0xCA).
Apr 01 20:26:59 kernel: [drm] register mmio base: 0xD1500000
Apr 01 20:26:59 kernel: [drm] register mmio size: 262144
Apr 01 20:26:59 kernel: [drm] add ip block number 0 <vi_common>
Apr 01 20:26:59 kernel: [drm] add ip block number 1 <gmc_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 2 <cz_ih>
Apr 01 20:26:59 kernel: [drm] add ip block number 3 <gfx_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 4 <sdma_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 5 <powerplay>
Apr 01 20:26:59 kernel: [drm] add ip block number 6 <dm>
Apr 01 20:26:59 kernel: [drm] add ip block number 7 <uvd_v6_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 8 <vce_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 9 <acp_ip>
Apr 01 20:26:59 kernel: [drm] UVD is enabled in physical mode
Apr 01 20:26:59 kernel: [drm] VCE enabled in physical mode
Apr 01 20:26:59 kernel: ATOM BIOS: 113-C91400-007
Apr 01 20:26:59 kernel: [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
Apr 01 20:26:59 kernel: [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Apr 01 20:26:59 kernel: [drm] Detected VRAM RAM=512M, BAR=512M
Apr 01 20:26:59 kernel: [drm] RAM width 64bits UNKNOWN
Apr 01 20:26:59 kernel: [TTM] Zone kernel: Available graphics memory: 3804974 KiB
Apr 01 20:26:59 kernel: [TTM] Zone dma32: Available graphics memory: 2097152 KiB
Apr 01 20:26:59 kernel: [TTM] Initializing pool allocator
Apr 01 20:26:59 kernel: [TTM] Initializing DMA pool allocator
Apr 01 20:26:59 kernel: [drm] amdgpu: 512M of VRAM memory ready
Apr 01 20:26:59 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Apr 01 20:26:59 kernel: [drm] GART: num cpu pages 262144, num gpu pages 262144
Apr 01 20:26:59 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F4007E9
Apr 01 20:26:59 kernel: [drm] Found UVD firmware Version: 1.91 Family ID: 11
Apr 01 20:26:59 kernel: [drm] UVD ENC is disabled
Apr 01 20:26:59 kernel: [drm] Found VCE firmware Version: 52.4 Binary ID: 3
Apr 01 20:26:59 kernel: smu version 27.17.00
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Engine clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: 30000...

|
#12 |
(In reply to Alex Deucher from comment #4)
> Can you bisect?
Unfortunately this is not possible as all latest kernels are now shipped with Display Core enabled by default and as I told 4.14 vanilla kernel works like a charm on same HW and with same mesa libs - no lags, no stucks or freezes and no warnings like listed above. So it's no sense to do "git bisect" as it's not a single commit which works incorrectly with GPU. DC - this a completely new functionality which replaces old amdgpu code

|
#13 |
Hi, i have a very similar problem. My system is working with 4.15 and with 5.1.16 but not with other 5.x kernels:
The System does not boot with 5.x kernels. With 5.1.16 the gui system freezes sometimes but sshd and mouse is still working.
CPU: Ryzen 5 2400g, BOARD: AORUS B450 I PRO WIFI, X Server 1.19.6
Kernel 5.0.x not working (blank screen after boot)
Kernel 5.2.x ( x <= 9 ) is not working (blank screen after boot)
but Kernel 5.1.16 is working (mostly)!
Error LOG with 5.1.16:
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: VM_L2_PROTECTIO
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1848 thread Xorg:cs0 pid 1849)
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: in page starting at address 0x000080010c205000 from 27
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: VM_L2_PROTECTIO
[Mi Aug 14 14:22:31 2019] [drm:amdgpu_
[Mi Aug 14 14:22:31 2019] [drm:amdgpu_
[Mi Aug 14 14:22:31 2019] [drm] GPU recovery disabled.

|
#14 |
Just got something similar while playing Left 4 Dead. The system simply froze with altered colors on the screen and the sound just looping over the last second or so. Cannot confirm SSH access.
journalctl -b -1 ends with
[drm:gfx_
[drm:amdgpu_
[drm:amdgpu_
OS: Ubuntu 19.04 on
Kernel: 5.0.0-27-generic
GPU: Radeon RX580
CPU: Ryzen 5 1600x
Thanks!

|
#15 |
(In reply to Ungureanu Alexandru from comment #9)
> Just got something similar while playing Left 4 Dead. The system simply
> froze with altered colors on the screen and the sound just looping over the
> last second or so. Cannot confirm SSH access.
> Kernel: 5.0.0-27-generic
> GPU: Radeon RX580
> CPU: Ryzen 5 1600x
5.0 is very outdated kernel, use latest from kernel.org
as for me all works perfectly in 5.3 (Chip polaris RX540)
finally I have no more any errors like these ones:
- ERROR* resume of IP block <uvd_v6_0> failed -110
- [drm] Fence fallback timer expired on ring sdma0
- last message was failed ret is **
- [drm:amdgpu_
- IP block:sdma_v3_0 is hung!
- Timeout, but no hardware hang detected.
Tested on youtube with HW accelerated video and in several games
Thank you guys from AMD a lot, I had to wait 1y+ to get these bugs fixed

|
#16 |
Same problem here. It happens when I run looking-glass [1], but not everytime. I tied downgrading my kernel from 5.3.1 to 5.2.11 (I'm pretty sure it worked then), downgrading mesa from 19.2.0 to 19.1.7 (I'm sure it worked with 19.2.0-rc) and downgrading my firmware to 2019-09-23 (oldest in repo).
When it happens looking glass starts blinking and sometimes my other monitor stuck that I can only move cursor on it.
Spec:
Gentoo ~amd64
Ryzen 1600 (other have Ryzen too, coincidence?)
Linux GPU: R7 240 (with radeon driver)
Windows GPU: RX580
ASRock X370 Gaming X

|
#17 |
Hi,
I think I have the same bug and opened https:/
At first it looked a bit different, because in newer kernels the error message has changed. But as you can see I did some testing and this seems to go way back. Sadly I couldn't test a 4.18 kernel.
Can somebody mark my report as duplicate? Because I think it is.
And Would some more debug info help?

|
#18 |
*** Bug 204683 has been marked as a duplicate of this bug. ***

|
#19 |
Also experiencing this with Radeon RX 5700 XT and amdgpu 19.1.0+
Didn't have any heavy load for the GPU to do.
First I had some artifacts appeared on Plasma Hard Disk Monitor widget and CPU Load Widget (here is a screenshot: https:/
I checked the logs for the period when this could've happened, but the only logs from that period are from KScreen that start like this:
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.
Oct 24 16:34:58 perk11-home org.kde.KScreen...

|
#20 |
My kernel version is 5.3.7-050307-

|
#21 |
Created attachment 285665
5 second video clip that triggers a crash
Hi,
I think I'm having the same problem as you guys. I run a mythbackend where I record cable television and those recordings often crash my system when hardware decoding is enabled. Usually it's just the screen that freezes and I can still ssh to it.
Kernel 5.1.6 was an exception for me too, with that kernel I'm able to restart the display manager and recover without having to reboot.
Attached is a short video that crashes my system. I can trigger the alert by running:
mpv --vo=vaapi out.ts
I'm wondering if it crashes your systems too and if it's related.

|
#22 |
(In reply to shallowaloe from comment #16)
> Created attachment 285665 [details]
> 5 second video clip that triggers a crash
>
> Hi,
>
> I think I'm having the same problem as you guys. I run a mythbackend where
> I record cable television and those recordings often crash my system when
> hardware decoding is enabled. Usually it's just the screen that freezes and
> I can still ssh to it.
>
> Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> restart the display manager and recover without having to reboot.
>
> Attached is a short video that crashes my system. I can trigger the alert
> by running:
>
> mpv --vo=vaapi out.ts
>
> I'm wondering if it crashes your systems too and if it's related.
Just to add a data point, I tried running `mpv --vo=vaapi out.ts` against your file, and while it crashed the application, it did not freeze the system.
My hardware is a Ryzen 3700X with a Radeon RX 5700, running Ubuntu 19.10 with default kernel (5.3.0-19-generic).
The command did result in the following lines in /var/log/syslog repeated every 5 seconds:
Nov 10 07:04:23 redacted kernel: [ 2266.802162] gmc_v10_
Nov 10 07:04:23 redacted kernel: [ 2266.802166] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802170] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802171] amdgpu 0000:0b:00.0: VM_L2_PROTECTIO
Nov 10 07:04:23 redacted kernel: [ 2266.802176] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802178] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802179] amdgpu 0000:0b:00.0: VM_L2_PROTECTIO
Nov 10 07:04:23 redacted kernel: [ 2266.802566] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802568] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802569] amdgpu 0000:0b:00.0: VM_L2_PROTECTIO
Nov 10 07:04:23 redacted kernel: [ 2266.802573] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802575] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802576] amdgpu 0000:0b:00.0: VM_L2_PROTECTIO
Nov 10 07:04:23 redacted kernel: [ 2266.802984] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802985] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802987] amdgpu 0000:0b:00.0: VM_L2_PROTECTIO
Nov 10 07:04:23 redacted kernel: [ 2266.802993] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802994] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802995] amdg...

|
#23 |
Hi,
I recently built a 5.4.0-rc7 from drm-next (my HEAD was 17eee668b3cad42
Since then I didn't get any crashes. I have tested this for a few hours now, but it's entirely possible that I just didn't run into the bug for some reason, although it usually appeared after half an hour.
If possible please try this setup and see if it is fixed.

|
#24 |
Hi,
This issue is still present in the latest kernels:
5.4.1, 5.4, 5.3.14
Last usable kernel for me is 4.20.17
System Specs
- Gigabyte b450-ds3h
- Ryzen 5 3400G (with RX Vega 11)
- Mesa 19.1.2 - padoka PPA (Stable)
- Ubuntu 18.04.3 LTS

|
#25 |
Dear j.cordoba,
is it possible that you try to build 5.4.0-rc7 from drm-next and give it a test as I mentioned in Comment 18?
I'm running on this for some time now and the bug should have appeared by now, so I'm getting more confident that it is fixed.
Best regards
Matthias

|
#26 |
Same is happening to me on 5.4.1. No issue with 4.9.
[ 44.172714] [drm:amdgpu_
[ 49.292694] [drm:amdgpu_
[ 58.469316] [drm:amdgpu_
[ 63.586055] [drm:amdgpu_
[ 156.606591] [drm:amdgpu_

|
#27 |
(In reply to shallowaloe from comment #16)
> Created attachment 285665 [details]
> 5 second video clip that triggers a crash
>
> Hi,
>
> I think I'm having the same problem as you guys. I run a mythbackend where
> I record cable television and those recordings often crash my system when
> hardware decoding is enabled. Usually it's just the screen that freezes and
> I can still ssh to it.
>
> Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> restart the display manager and recover without having to reboot.
>
> Attached is a short video that crashes my system. I can trigger the alert
> by running:
>
> mpv --vo=vaapi out.ts
>
> I'm wondering if it crashes your systems too and if it's related.
This one is probably a Mesa issue, see https:/
What Mesa version are you using?

|
#28 |
Created attachment 286227
attachment-
Thanks for the link to the bug. I'm running an ubuntu based system and am
using the oibaf ppa. The current version is 20.0.
On Wed, Dec 4, 2019 at 1:54 AM <email address hidden> wrote:
> https:/
>
> Pierre-Eric Pelloux-Prayer (<email address hidden>) changed:
>
> What |Removed |Added
>
> -------
> CC|
> |pierre-
> | |amd.com
>
> --- Comment #22 from Pierre-Eric Pelloux-Prayer (
> <email address hidden>) ---
> (In reply to shallowaloe from comment #16)
> > Created attachment 285665 [details]
> > 5 second video clip that triggers a crash
> >
> > Hi,
> >
> > I think I'm having the same problem as you guys. I run a mythbackend
> where
> > I record cable television and those recordings often crash my system when
> > hardware decoding is enabled. Usually it's just the screen that freezes
> and
> > I can still ssh to it.
> >
> > Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> > restart the display manager and recover without having to reboot.
> >
> > Attached is a short video that crashes my system. I can trigger the
> alert
> > by running:
> >
> > mpv --vo=vaapi out.ts
> >
> > I'm wondering if it crashes your systems too and if it's related.
>
>
> This one is probably a Mesa issue, see
> https:/
>
> What Mesa version are you using?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

|
#29 |
Hi everyone,
I have the same issue with a Fiji Nano GPU: UVD6 and VCE3 timeout in ring buffer test @ boot with the AMDGPU driver. Other rings seem to work correctly.
To make sure the hardware functions like it should, and it's not a HW error, where (in the amdgpu driver) can I increase the timeout value?

|
#30 |
Created attachment 286575
kernel config 5.4.7 Fiji
Some additional info for my case:
- Running kernel 5.4.7 (vanilla), firmware 20191108 on gentoo
- Dmesg | grep -E "(drm)|(amdgpu)":
[ 3.930023] [drm] amdgpu kernel modesetting enabled.
[ 3.930217] amdgpu 0000:0a:00.0: remove_
[ 3.930219] amdgpu 0000:0a:00.0: remove_
[ 3.930221] amdgpu 0000:0a:00.0: remove_
[ 3.930224] fb0: switching to amdgpudrmfb from EFI VGA
[ 3.930475] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA).
[ 3.930486] [drm] register mmio base: 0xFCE00000
[ 3.930486] [drm] register mmio size: 262144
[ 3.930495] [drm] add ip block number 0 <vi_common>
[ 3.930495] [drm] add ip block number 1 <gmc_v8_0>
[ 3.930496] [drm] add ip block number 2 <tonga_ih>
[ 3.930497] [drm] add ip block number 3 <gfx_v8_0>
[ 3.930498] [drm] add ip block number 4 <sdma_v3_0>
[ 3.930498] [drm] add ip block number 5 <powerplay>
[ 3.930499] [drm] add ip block number 6 <dm>
[ 3.930500] [drm] add ip block number 7 <uvd_v6_0>
[ 3.930500] [drm] add ip block number 8 <vce_v3_0>
[ 3.930715] [drm] UVD is enabled in physical mode
[ 3.930715] [drm] VCE enabled in physical mode
[ 3.930743] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[ 3.930751] amdgpu 0000:0a:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[ 3.930753] amdgpu 0000:0a:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[ 3.930758] [drm] Detected VRAM RAM=4096M, BAR=256M
[ 3.930759] [drm] RAM width 512bits HBM
[ 3.930838] [drm] amdgpu: 4096M of VRAM memory ready
[ 3.930841] [drm] amdgpu: 4096M of GTT memory ready.
[ 3.930860] [drm] GART: num cpu pages 262144, num gpu pages 262144
[ 3.930928] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5
[ 3.934174] [drm] Chained IB support enabled!
[ 3.940198] amdgpu: [powerplay] hwmgr_sw_init smu backed is fiji_smu
[ 3.941748] [drm] Found UVD firmware Version: 1.91 Family ID: 12
[ 3.941752] [drm] UVD ENC is disabled
[ 3.943542] [drm] Found VCE firmware Version: 55.2 Binary ID: 3
[ 4.009146] [drm] dce110_
[ 4.040084] [drm] Display Core initialized with v3.2.48!
[ 4.040542] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 4.040543] [drm] Driver supports precise vblank timestamp query.
[ 4.067774] [drm] UVD initialized successfully.
[ 4.168780] [drm] VCE initialized successfully.
[ 4.170163] [drm] Cannot find any crtc or sizes
[ 4.171948] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:0a:00.0 on minor 0
[ 7.280062] amdgpu 0000:0a:00.0: [drm:amdgpu_
[ 8.400365] amdgpu 0000:0a:00.0: [drm:amdgpu_
[ 8.400370] [drm:process_

|
#31 |
Hello, I have the same problem on a Huawei Matebook D lapop, processor is an AMD Ryzen 5 with an integrated Radeon Vega Mobile GPU.
I use Fedora 31. The problem appeared when upgrading from then 5.3.16 kernel to the 5.4.6 kernel. Reverting to 5.3.16 solved the issue.
At some moments the UI (XFCE) freezes for about 5 seconds; I can move the mouse cursor but I can't get any keyboard input (not in X, not by switching console). Each time the freeze occurs dmesg shows the messages
[ 45.530374] [drm:amdgpu_
[ 50.139408] [drm:amdgpu_
I include /proc/cpuinfo and lspci outputs.

|
#32 |
Created attachment 286899
/proc/cpuinfo

|
#33 |
Created attachment 286901
lspci output

|
#34 |
Hi. This bug is already reported here by me https:/
If possible try a 5.5-rc kernel and see if it's fixed there. It's fixed - at least for me - in the drm-tree.
Best regards
Matthias

|
#35 |
I"m seeing the same issue on Ubuntu 18.04 with
Upstream PPA "sudo add-apt-repository ppa:oibaf/
[ 321.412530] [drm:amdgpu_
[ 326.286306] [drm:amdgpu_
[ 326.286395] [drm:amdgpu_
AMDGPUPRO driver 19.50-967956
[20913.330563] [drm:amdgpu_
[20918.450513] [drm:amdgpu_
[20923.570306] [drm:amdgpu_
[20928.690699] [drm:amdgpu_

|
#36 |
Hi,
for me this bug is fixed with a 5.5 kernel. And I'm wondering if this is fixed for all of you, too.
Best
Matthias

|
#37 |
I agree. Fixed for me too

|
#38 |
I still see them on 5.6.13:
[191571.372560] sd 11:0:0:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
[205796.424607] [drm:amdgpu_
[205796.424637] [drm:amdgpu_
[205796.424640] amdgpu 0000:0a:00.0: GPU reset begin!
[205800.840504] [drm:amdgpu_
[205800.937565] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205800.938060] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900
[205800.938849] [drm] PSP is resuming...
[205800.958729] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
[205800.972414] [drm] psp command (0x5) failed and response status is (0xFFFF0007)
[205801.176411] amdgpu 0000:0a:00.0: RAS: ras ta ucode is not available
[205801.460775] [drm] kiq ring mec 2 pipe 1 q 0
[205801.460986] amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0002 address=0x800002300 flags=0x0000]
[205801.516698] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[205801.516709] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[205801.516713] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[205801.516717] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[205801.516720] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[205801.516724] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[205801.516727] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[205801.516730] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[205801.516733] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[205801.516736] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[205801.516740] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[205801.516743] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[205801.516746] amdgpu 0000:0a:00.0: ring vcn_dec uses VM inv eng 1 on hub 1
[205801.516749] amdgpu 0000:0a:00.0: ring vcn_enc0 uses VM inv eng 4 on hub 1
[205801.516752] amdgpu 0000:0a:00.0: ring vcn_enc1 uses VM inv eng 5 on hub 1
[205801.516755] amdgpu 0000:0a:00.0: ring jpeg_dec uses VM inv eng 6 on hub 1
[205801.525996] [drm] recover vram bo from shadow start
[205801.525998] [drm] recover vram bo from shadow done
[205801.526008] [drm] Skip scheduling IBs!
[205801.526051] amdgpu 0000:0a:00.0: GPU reset(1) succeeded!
[205802.536444] [drm:amdgpu_
[205802.536523] [drm:amdgpu_
[205802.536531] amdgpu 0000:0a:00.0: GPU reset begin!
[205806.728558] [drm:amdgpu_
[205806.821326] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205806.821578] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900
[205806.821899] [drm] PSP is...

|
#39 |
The problem still exists with Linux Kernel 5.8-rc1 from git. (My graphics card is Radeon 5600XT)
[20581.087159] [drm:amdgpu_
[20581.087212] [drm:amdgpu_
[20581.087217] amdgpu 0000:29:00.0: amdgpu: GPU reset begin!
[20583.381257] [drm:amdgpu_
[20585.087232] amdgpu 0000:29:00.0: amdgpu: failed to suspend display audio
[20585.156036] snd_hda_codec_hdmi hdaudioC0D0: HDMI: ELD buf size is 0, force 128
[20585.156052] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 0
[20585.463157] amdgpu 0000:29:00.0: [drm:amdgpu_
[20585.463205] [drm:gfx_
[20585.694999] amdgpu 0000:29:00.0: [drm:amdgpu_
[20585.695047] [drm:gfx_
[20585.926951] [drm:gfx_
[20588.045497] amdgpu 0000:29:00.0: amdgpu: GPU reset succeeded, trying to resume
[20588.045605] [drm] PCIE GART of 512M enabled (table at 0x0000008000E10
[20588.045682] [drm] VRAM is lost due to GPU reset!
[20588.048023] [drm] PSP is resuming...
[20588.218089] [drm] reserve 0x900000 from 0x817e400000 for PSP TMR
[20588.287093] amdgpu 0000:29:00.0: amdgpu: RAS: optional ras ta ucode is not available
[20588.293101] amdgpu: SMU is resuming...
[20588.295088] amdgpu: SMU is resumed successfully!
[20588.413155] [drm] kiq ring mec 2 pipe 1 q 0
[20588.417493] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[20588.417632] [drm] JPEG decode initialized successfully.
[20588.417690] amdgpu 0000:29:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[20588.417693] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[20588.417697] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[20588.417700] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[20588.417703] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[20588.417707] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[20588.417709] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[20588.417713] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[20588.417716] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[20588.417719] amdgpu 0000:29:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[20588.417721] amdgpu 0000:29:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[20588.417724] amdgpu 0000:29:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[20588.417726] amdgpu 0000:29:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[20588.417728] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[20588.417730] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on h...

|
#40 |
I've been getting "ring gfx timeouts" for some time, most of the time it's when the computer has not had any input for a while (while I'm away from it). When it freezes I can SSH into it but when I try to do a: "shutdown -h now" it boots me out of SSH as it should but the computer never seems to actually shutdown. The screen stays frozen with whatever was on the display when it froze. Any help would be greatly appreciated, here is my info:
Mobo: AsRock AB350 Pro4 UEFI: 5.80
Video card: Sapphire Nitro+ RX580 (8GB)
Distro: Manjaro
Kernel: 5.7.9-1-MANJARO
Aug 09 21:33:06.054857 kernel: pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
Aug 09 21:33:06.068305 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=
Aug 09 21:33:06.068636 kernel: pcieport 0000:00:03.1: AER: device [1022:1453] error status/
Aug 09 21:33:06.068863 kernel: pcieport 0000:00:03.1: AER: [21] ACSViol (First)
Aug 09 21:33:06.069137 kernel: amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069421 kernel: snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069633 kernel: pcieport 0000:00:03.1: AER: device recovery failed
Aug 09 21:33:16.258283 kernel: [drm:amdgpu_
Aug 09 21:33:16.258412 kernel: [drm:amdgpu_
Aug 09 21:33:16.258446 kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 21:33:16.258741 kernel: [drm:amdgpu_
Aug 09 21:33:16.258773 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.258803 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.258835 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.258869 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.258896 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.258925 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.258951 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.258977 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.259009 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.259035 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.259060 kernel: amdgpu: [powerplay]
Aug 09 21:33:16.259084 kernel: amdgpu: [powerplay]
34 comments hidden
Loading more comments
|
view all 114 comments |

|
#75 |
My Ubuntu 20.04 desktop is crashing several times per day due to this bug since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in both computers, so I assume this is probably more related to the mainboard/CPU than to the graphics card.
The crashes from today:
```
martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed to initialize parser'
Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_
Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_
Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_
Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_
Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_
Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_
Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_
Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_
Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_
Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_
Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_
Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_
Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_
Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_
Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_
Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_
Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu: WALKER_ERROR: 0x0
Jun 11 23:29:51 martin kernel: [25591.333485] amdgpu 0000:2d:00.0: amdgpu: MAPPING_ERROR: 0x0
Jun 11 23:30:01 martin kernel: [25601.412838] [drm:amdgpu_

|
#76 |
(In reply to Martin von Wittich from comment #70)
> My Ubuntu 20.04 desktop is crashing several times per day due to this bug
> since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9
> 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in
> both computers, so I assume this is probably more related to the
> mainboard/CPU than to the graphics card.
>
> The crashes from today:
>
> ```
> martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed
> to initialize parser'
> Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750601, emitted seq=1750603
> Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750605, emitted seq=1750608
> Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216293, emitted seq=216295
> Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process firefox pid 5237 thread
> firefox:cs0 pid 5302
> Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216297, emitted seq=216300
> Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
> Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264406, emitted seq=264408
> Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264410, emitted seq=264413
> Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390985, emitted seq=2390987
> Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process firefox pid 4922 thread
> firefox:cs0 pid 4989
> Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390989, emitted seq=2390992
> Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_
> [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
> Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu:
> WALKER_ERROR: 0x0
> Jun 11 23:29:51 martin kernel: [25591.333485] am...

|
#77 |
I can confirm that adding "amdgpu.dpm=0" to the kernel command line seems to resolve this issue - I enabled that option on 2022-06-12 13:24, and my system didn't crash at all on 2022-06-12 - 2022-06-14 (I was on vacation from 2022-06-15 on and didn't use my computer from then on).
I don't use Linux for gaming and therefore can't comment how badly this affects gaming performance, but I did notice mpv could no longer play 1080p x264 video without stuttering when it defaults to --vo=gpu. Using another --vo like sdl seems to be a viable workaround.
> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like this. Also try these kernel parameters : "amdgpu.
I'll try these next.

|
#78 |
Sorry, forgot to mention in my last post and now can't edit: interestingly enough, the attached video "5 second video clip that triggers a crash" still successfully triggers the crash.
Seems to me like the root issue isn't actually in the dynamic power management code, but somewhere else, and the DPM is just one of several things that can trigger it?

|
#79 |
> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like this. Also try these kernel parameters : "amdgpu.
I can confirm that at least on the current Ubuntu linux-image-
```
martin@martin ~ % uname -a
Linux martin 5.14.0-1042-oem #47-Ubuntu SMP Fri Jun 3 18:17:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
martin@martin ~ % cat /proc/cmdline
BOOT_IMAGE=
martin@martin ~ % dmesg -T | grep 'ring gfx timeout'
[Mi Jun 22 14:48:07 2022] [drm:amdgpu_
[Mi Jun 22 14:48:18 2022] [drm:amdgpu_
```
I had enabled these options on 2022-06-20 14:14 UTC+2, this is the first crash I've encountered since then.
I have no idea how to build the latest kernel and therefore haven't tested that yet.
I'll now revert back to amdgpu.dpm=0.

|
#80 |
> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like
> this. Also try these kernel parameters : "amdgpu.
> amdgpu.noretry=0 amdgpu.
> amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also
> try with amdgpu.
I tried.
my kernel:
"Linux 5.17.4-1-default #1 SMP PREEMPT Wed Apr 20 07:43:03 UTC 2022 (75e9961) x86_64 x86_64 x86_64 GNU/Linux"
(this video linked above - were not able to freeze integrated AMD GPU for me, I mean before I tested with no kernel parameters)
Result is surprising - no crash/freeze for 4+ hours already, I did launch lots of apps that were reason of freeze for me before.
As I described above - https:/
Full kernel boot option now: "splash=silent quiet amdgpu.
Now, after boot with these options, I see:
Just after boot everything working (OpenGL/Vulkan acceleration by integrated GPU) with expected performance.
After trying to "trigger bug" (opening multiple OpenGL apps with Vulkan and WebGL and playing many videos) - OpenGL and Vulkan drops FPS to 20(constant for single triangle in fullscreen), WebGL2 does not work anymore in webbrowser(even after browser restart), but Video - still playing with 60 fps with no lag, and system UI also does not lag.
So GPU graphics acceleration just drop to very low performance mode look like, but everything else works fine. (also launching graphic apps(native only) using Nvidia GPU works with 60fps as expected).
Interesting - since FPS droped 20 I can no longer launch "anything" in Wine (any version include Proton) (after boot it was working), I launched few apps after boot and check them when GPU FPS drops wine always crash with:
"wine: Unhandled page fault on execute access to 00007F894E200460 at address 00007F894E200460 (thread 0070), starting debugger..."
(not being able to use Wine is a big disadvantage)

|
#81 |
Wine problem - this happened because (how/why/when) '/usr/share/
so fix for wine gonna be - "VK_ICD_
super weird, so wine problem fixed I think

|
#82 |
but even creating nvidia_icd.json
{
"file_
"ICD": {
}
}
does not help wine, Wine still crashing with same error on trying use/initialize Nvidia
but I can use Nvidia outside of Wine from native apps (and Vulkan works), so it must be related to AMD gpu driver somehow (before it was not happening, I first time seeing wine crashing this way(in previous times when I tested AMD GPU integrated))
P.S. I have second PC with same AMD Vega 8 integrated GPU, and there it works fine(never crashed/freeze even once), other PC has other motherboard, this why I originally think it problem with motherboard, but current "boot option" help to make integrated GPU stable on this PC.

|
#83 |
(I did small mistake in my file organizing, creating nvidia_icd.json with listed above content is enough to fix Wine for me, everything works now)

|
#84 |
Updated to kernel 5.18.4-1-default #1 SMP PREEMPT_DYNAMIC Wed Jun 15 06:00:33 UTC 2022 (ed6345d) x86_64 x86_64 x86_64 GNU/Linux (OpenSuSe latest for now)
Seems my integrated AMD GPU freeze completely fixed even without using previous boot option (in 5.17 it was freezing without boot option), also integrated GPU does not go to "low performance mode forever"(like it was with boot option before) it continues working for hours on max performance(I mean it works without slowdown like before)
... but now Nvidia GPU does not work anymore from AMD (when integrated is main GPU), Nvidia 515.48.07 driver(latest now), in X11 and Wayland, Nvidia driver correctly installed and device visible (nvidia-smi works and vulkaninfo --summary list Nvidia GPU correctly), on creating Vulkan surface on Nvidia device application always crash (any application)... (just tested - disabling AMD integrated and boot using Nvidia - everything works there, Vulkan etc)
So fixing integrated AMD GPU result in Nvidia does not work anymore... okey (im back to use discrete Nvidia only again)

|
#85 |
same issue here with (also LTS kernel as well)
Linux archlinux 5.18.7-262-tkg-pds #1 TKG SMP PREEMPT_DYNAMIC Mon, 27 Jun 2022 15:50:06 +0000 x86_64 GNU/Linux
[11090.086287] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11090.086296] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11090.086302] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11090.195133] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11090.195139] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11090.195143] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11090.195150] [drm] Cannot get clockgating state when UVD is powergated.
[11090.195152] [drm] Cannot get clockgating state when VCE is powergated.
[11090.695288] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11090.699331] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11091.194893] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11091.194898] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11091.194901] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11091.194908] [drm] Cannot get clockgating state when UVD is powergated.
[11091.194909] [drm] Cannot get clockgating state when VCE is powergated.
[11091.695473] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11092.194965] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11092.194969] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11092.194973] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11092.194979] [drm] Cannot get clockgating state when UVD is powergated.
[11092.194980] [drm] Cannot get clockgating state when VCE is powergated.
[11092.695749] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11093.195046] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11093.195050] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11093.195053] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11093.195060] [drm] Cannot get clockgating state when UVD is powergated.
[11093.195061] [drm] Cannot get clockgating state when VCE is powergated.
[11093.695004] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11094.195065] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11094.195070] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11094.195074] amdgpu 0000:02:00.0: amdgpu:
last message was failed ret is 65535
[11094.195082] [drm] Cannot get clockgating state when UVD is powergated.
[11094.195083] [drm] Cannot get clockgating state when VCE is powergated.
[11094.695286] amdgpu 0000:02:00.0: amdgpu:
last mess...

|
#86 |
Nvidia released 515.57 drivers that fix "Nvidia being broken when used as second GPU in Linux", my bug above.
Nvidia GPU works again when AMD GPU main.

|
#87 |
Afteer using this PC for few days with AMD Vega 8 (integrated) as main GPU I see no freezes at all. (before in 2021 it was freeze every 10-20 mins so I had to use Nvidia as main GPU)
(works with and without listed above kernel boot option)
I use OpenSuse kernel 5.18.4-1-default (not going to update for some time, because it works)
Maybe it just fixed for "my motherboard+CPU combination", my hardware:
Ryzen3 3200 CPU (Vega8 integrated) on A320M-DVS R4.0 motherboard.
microcode: CPU: patch_level=
microcode: Microcode Update Driver: v2.2.
Wayland and x11 works, with Nvidia as second GPU.
Wayland slowdown(to like 1-2FPS whole UI performance) once after few hours of using, but it fixed just by switching to system-
integrated GPU performance still goes down(in few hours, randomly in 2-6 hours of PC use) and never go back, but its fine(since I have Nvidia second GPU for complex graphic), Vega 8 performance go down only in "complex shaders" FPS drop from 60 fullscreen(1080p) to 10-20 on complex raymarching shaders, but for system UI (Wayland/x11 Gnome 42) this is not noticeable, and video play on 60fps as expected. (Sleep mode also works, not every time(because Nvidia) but most of the time, same as when used Nvidia as main GPU)

|
#88 |
Log from what I described above - "fixed just by switching to system-
Logs:
Jul 17 22:54:04 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:09 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:12 home-danil kernel: ------------[ cut here ]------------
Jul 17 22:54:12 home-danil kernel: WARNING: CPU: 1 PID: 1100 at drivers/
Jul 17 22:54:12 home-danil kernel: Modules linked in: dm_crypt essiv authenc trusted asn1_encoder tee nvidia_uvm(POE) nvidia_modeset(POE) nvidia(POE) snd_seq_dummy snd_hrtimer snd_seq snd_seq_device af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set iscsi_ibft iscsi_boot_sysfs nfnetlink rfkill ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter qrtr vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dmi_sysfs joydev intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi snd_hda_
Jul 17 22:54:12 home-danil kernel: libphy irqbypass snd soundcore efi_pstore i2c_piix4 gpio_amdpt gpio_generic acpi_cpufreq k10temp tiny_power_button nls_iso8859_1 squashfs nls_cp437 loop ext4 mbcache vfat jbd2 fat fuse configfs ip_tables x_tables hid_generic usbhid uas usb_storage amdgpu crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_dp_helper drm_kms_helper aesni_intel crypto_simd syscopyarea sysfillrect sysimgblt fb_sys_fops cryptd drm cec xhci_pci xhci_pci_renesas sp5100_tco ccp rc_core xhci_hcd usbcore wmi video button btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs
Jul 17 22:54:12 home-danil kernel: CPU: 1 PID: 1100 Comm: systemd-logind Tainted: P OE 5.18.4-1-default #1 openSUSE Tumbleweed 59778fa2462c9ee
Jul 17 22:54:12 home-danil kernel: Hardware name: To Be Filled By O.E.M. A320M-DVS R4.0/A320M-DVS R4.0, BIOS P7.10 12/23/2021
Jul 17 22:54:12 home-danil kernel: RIP: 0010:rv1_
Jul 17 22:54:12 home-danil kernel: Code: 62 01 00 e8 8f 4e f5 ff 85 c0 74 d8 83 f8 01 75 19 48 8b 7d 00 5b be 93 62 01 00 48 c7 c2 00 99 cd c0 5d 41 5c e9 6d 4e f5 ff <0f> 0b eb e3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 81 c6 e7 03
Jul 17 22:54:12 home-danil kernel: RSP: 0018:ffff9f0a00
Jul 17 22:54:12 home-danil kernel: RAX: 00007570227d95d8 RBX: 00000000000000...

|
#89 |
amd driver problem,u can connect me ,i'll give u the final solution,email <email address hidden> ,maybe in China will get more efficent communication

|
#90 |
[67760.805903] [drm:amdgpu_
[67760.806285] [drm:amdgpu_
[67760.806667] amdgpu 0000:0d:00.0: amdgpu: GPU reset begin!
[67761.257012] amdgpu 0000:0d:00.0: [drm:amdgpu_
[67761.257232] [drm:gfx_
[67761.307862] [drm:amdgpu_
[67761.516374] [drm:gfx_
[67761.542980] [drm] free PSP TMR buffer
[67761.587266] amdgpu 0000:0d:00.0: amdgpu: MODE1 reset
[67761.587269] amdgpu 0000:0d:00.0: amdgpu: GPU mode1 reset
[67761.587329] amdgpu 0000:0d:00.0: amdgpu: GPU smu mode1 reset
[67762.091974] amdgpu 0000:0d:00.0: amdgpu: GPU reset succeeded, trying to resume
[67762.092156] [drm] PCIE GART of 512M enabled (table at 0x0000008000300
[67762.092219] [drm] VRAM is lost due to GPU reset!
[67762.092220] [drm] PSP is resuming...
[67762.168492] [drm] reserve 0xa00000 from 0x8001000000 for PSP TMR
[67762.269801] amdgpu 0000:0d:00.0: amdgpu: RAS: optional ras ta ucode is not available
[67762.283510] amdgpu 0000:0d:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[67762.283513] amdgpu 0000:0d:00.0: amdgpu: SMU is resuming...
[67762.283516] amdgpu 0000:0d:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413900 (65.57.0)
[67762.283519] amdgpu 0000:0d:00.0: amdgpu: SMU driver if version not matched
[67762.283549] amdgpu 0000:0d:00.0: amdgpu: use vbios provided pptable
[67762.343739] amdgpu 0000:0d:00.0: amdgpu: SMU is resumed successfully!
[67762.345104] [drm] DMUB hardware initialized: version=0x02020017
[67762.615558] [drm] kiq ring mec 2 pipe 1 q 0
[67762.618728] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[67762.618910] [drm] JPEG decode initialized successfully.
[67762.618918] amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[67762.618921] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[67762.618922] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[67762.618924] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[67762.618925] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[67762.618926] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[67762.618927] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[67762.618929] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[67762.618930] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[67762.618931] amdgpu 0000:0d:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[67762.618933] amdgpu 0000:0d:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[67762.618934] amdgpu 0000:0d:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[67762.618936] amd...

|
#91 |
Created attachment 304307
Started testing kernel 6.4-rc3 got the same problem

|
#92 |
Is it worth the effort of bisecting this as it seems to be on a lot of kernel versions ?
thanks

|
#93 |
Status = NEW after nearly 5 years?
I have the same problem
Aug 15 14:18:19 nb-tz kernel: [drm:amdgpu_
Aug 15 14:18:19 nb-tz kernel: [drm:amdgpu_

|
#94 |
AMD Vega 64 (vega10 chip)
kernel: 6.4.9
linux-firmware: 20230724
# graphical session died and had to log in again, computer didn't boot though...
aug 20 02:11:06 Zen kernel: [drm:amdgpu_
aug 20 02:11:06 Zen kernel: [drm:amdgpu_
linux-firmware: 20230810 (upgraded it... although there was no "vega10" changes inbetween)
# just freeze for like 30s and then it got unstuck again.
aug 23 23:09:24 Zen kernel: [drm:amdgpu_
aug 23 23:09:34 Zen kernel: [drm:amdgpu_
aug 23 23:09:44 Zen kernel: [drm:amdgpu_

|
#95 |
AMD Ryzen 3700U APU (Vega 10)
This issue has recently started happening, mostly when firing up games or graphically intensive tasks. One case of lockup during normal desktop use.
Worked fine on 6.4.X series (currently running on 6.4.12). However, all kernels in the 6.5 series cause the following:
[ 112.727138] [drm:amdgpu_
[ 112.728214] [drm:amdgpu_
[ 112.729270] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[ 112.885652] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[ 112.885709] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 112.886024] [drm] PCIE GART of 1024M enabled.
[ 112.886027] [drm] PTB located at 0x000000F400A00000
[ 112.886143] [drm] PSP is resuming...
[ 112.906168] [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
[ 112.985033] amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 112.992320] amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 113.733685] [drm] kiq ring mec 2 pipe 1 q 0
[ 113.998619] amdgpu 0000:04:00.0: [drm:amdgpu_
[ 113.999249] [drm:amdgpu_
[ 113.999957] amdgpu 0000:04:00.0: amdgpu: GPU reset(2) failed
[ 114.000006] amdgpu 0000:04:00.0: amdgpu: GPU reset end with ret = -110
[ 114.000010] [drm:amdgpu_

|
#96 |
I can confirm this bug
Experiencing it on an AMD Ryzen 5 3500U (Vega 8), Fedora 39 beta, kernel 6.5.2.
Also on Arch (kernel 6.5.2).
No problems on Fedora 38 (kernel 6.2.x).
In my case it happens frequently with normal desktop use on Fedora and Arch.
Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_
Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Sep 23 03:39:34 jackdaw kernel: [drm] PCIE GART of 1024M enabled.
Sep 23 03:39:34 jackdaw kernel: [drm] PTB located at 0x000000F400A00000
Sep 23 03:39:34 jackdaw kernel: [drm] PSP is resuming...
Sep 23 03:39:34 jackdaw kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: [drm] kiq ring mec 2 pipe 1 q 0
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: [drm:amdgpu_
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(2) failed
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_
Sep 23 03:39:35 jackdaw kernel: [drm] Skip scheduling IBs!
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_
Sep 23 03:39:45 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!

|
#97 |
AMDGPU development is on its own bug tracker:
https:/
If you're still affected, check for existing bug reports and if there are none, please repost over there.

|
#98 |
I have also been having this issue. It started occurring recently (last 2-3 months). No other changes.
Mostly lockups while gaming (yuzu), one lockup because of chrome.
I was able to fix this issue by switching from HDMI to DP or DVI.

|
#99 |
Created attachment 305165
attachment-
In my case the fix was adding amdgpu.mcbp=0 to the kernel parameters.
On Sat, Sep 30, 2023 at 8:57 PM <email address hidden> wrote:
> https:/
>
> <email address hidden> changed:
>
> What |Removed |Added
>
> -------
> CC| |<email address hidden>
>
> --- Comment #93 from <email address hidden> ---
> I have also been having this issue. It started occurring recently (last 2-3
> months). No other changes.
>
> Mostly lockups while gaming (yuzu), one lockup because of chrome.
>
> I was able to fix this issue by switching from HDMI to DP or DVI.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.

|
#100 |
(In reply to KC from comment #94)
Did you have it set to 1 previously? If not, I'm not sure if that was the silver bullet, because it looks like it defaults to 0. https:/
mcbp (int)
It is used to enable mid command buffer preemption. (0 = disabled (default), 1 = enabled)

|
#101 |
Created attachment 305166
attachment-
The default is now -1.
https:/
https:/
I set it to zero and I haven't had a single crash since (Fedora 39 beta,
Linux 6.5.5).
This one parameter change made my system entirely unusable (it would crash
very quickly after booting).
On Sat, Sep 30, 2023 at 9:35 PM <email address hidden> wrote:
> https:/
>
> --- Comment #95 from <email address hidden> ---
> (In reply to KC from comment #94)
>
> Did you have it set to 1 previously? If not, I'm not sure if that was the
> silver bullet, because it looks like it defaults to 0.
> https:/
>
> mcbp (int)
>
> It is used to enable mid command buffer preemption. (0 = disabled
> (default), 1
> = enabled)
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
99 comments hidden
Loading more comments
|
view all 114 comments |

Pirouette Cacahuète (lissyx) wrote : | #1 |
- AlsaInfo.txt Edit (91.4 KiB, text/plain; charset="utf-8")
- AudioDevicesInUse.txt Edit (669 bytes, text/plain; charset="utf-8")
- CRDA.txt Edit (5.8 KiB, text/plain; charset="utf-8")
- CurrentDmesg.txt Edit (156.1 KiB, text/plain; charset="utf-8")
- Dependencies.txt Edit (3.3 KiB, text/plain; charset="utf-8")
- IwConfig.txt Edit (733 bytes, text/plain; charset="utf-8")
- Lspci.txt Edit (84.9 KiB, text/plain; charset="utf-8")
- Lspci-vt.txt Edit (2.6 KiB, text/plain; charset="utf-8")
- Lsusb.txt Edit (1.5 KiB, text/plain; charset="utf-8")
- Lsusb-t.txt Edit (3.0 KiB, text/plain; charset="utf-8")
- Lsusb-v.txt Edit (143.6 KiB, text/plain; charset="utf-8")
- ProcCpuinfo.txt Edit (24.6 KiB, text/plain; charset="utf-8")
- ProcCpuinfoMinimal.txt Edit (1.5 KiB, text/plain; charset="utf-8")
- ProcInterrupts.txt Edit (23.2 KiB, text/plain; charset="utf-8")
- ProcModules.txt Edit (11.0 KiB, text/plain; charset="utf-8")
- RfKill.txt Edit (250 bytes, text/plain; charset="utf-8")
- UdevDb.txt Edit (454.8 KiB, text/plain; charset="utf-8")
- WifiSyslog.txt Edit (230.4 KiB, text/plain; charset="utf-8")
- acpidump.txt Edit (1.0 MiB, text/plain; charset="utf-8")

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed | #2 |
This change was made by a bot.
Changed in linux (Ubuntu): | |
status: | New → Confirmed |

Pirouette Cacahuète (lissyx) wrote : | #3 |

Erich Eickmeyer (eeickmeyer) wrote (last edit ): | #4 |
Working with Pirouette on IRC, we determined this may be related to https:/
They also found mentions of https:/
97 comments hidden
Loading more comments
|
view all 114 comments |

Mario Limonciello (superm1) wrote : | #102 |
6.5.6 has the fix for preemption issue, it should get fixed when stable updates come in Mantic.

Pirouette Cacahuète (lissyx) wrote : | #103 |
Thanks, I'll try and keep you updated, however I am also facing bug 2039958 (probably a dupe of bug 2034619), so I might still need GNOME 45.1 to be released.

|
#104 |
Hello, I'm having this same issue with my thinkpad z16 laptop, Ryzen 6850H and Radeon RX 6500M graphics card.
I do not use the laptop for gaming but for audio and video editing. I have not had trouble with any video editing software but I can easily reproduce the issue by loading up Ardour or Mixbus32C and either leaving it alone or working. After 15 minutes the screen freezes although audio will continue for a time. At this point Ardour or Mixbus will close and I can continue using the machine. If I load up either program again it will fail again, usually within a couple minutes and the whole laptop will freeze up until I ctrl-alt-F2 to get to a terminal prompt.
The issue always happens when Im recording audio with an HDMI device attached and 90% of the time without HDMI
I will attempt to set this kernel parameter amdgpu.mcbp=0 and report back.

|
#105 |
(In reply to jeremy boyd from comment #97)
> Hello, I'm having this same issue with my thinkpad z16 laptop, Ryzen 6850H
> and Radeon RX 6500M graphics card.
>
> I do not use the laptop for gaming but for audio and video editing. I have
> not had trouble with any video editing software but I can easily reproduce
> the issue by loading up Ardour or Mixbus32C and either leaving it alone or
> working. After 15 minutes the screen freezes although audio will continue
> for a time. At this point Ardour or Mixbus will close and I can continue
> using the machine. If I load up either program again it will fail again,
> usually within a couple minutes and the whole laptop will freeze up until I
> ctrl-alt-F2 to get to a terminal prompt.
>
> The issue always happens when Im recording audio with an HDMI device
> attached and 90% of the time without HDMI
>
> I will attempt to set this kernel parameter amdgpu.mcbp=0 and report back.
I can confirm that this did not solve my problem. I tested my system out for several hours with no issue and thought that perhaps it had been solved but while doing a libreoffice presentation with my audio software running it happened again. here is the error from journalctl
Oct 22 09:40:01 fedora kernel: [drm:amdgpu_
Oct 22 09:40:01 fedora kernel: [drm:amdgpu_
Oct 22 09:40:01 fedora kernel: amdgpu 0000:67:00.0: amdgpu: GPU reset begin!
Oct 22 09:40:02 fedora kernel: amdgpu 0000:67:00.0: amdgpu: MODE2 reset
Oct 22 09:40:02 fedora kernel: amdgpu 0000:67:00.0: amdgpu: GPU reset succeeded, trying to resume

|
#106 |
#98
The amdgpu.mcbp=0 will only help GFX9 products. For GFX10 this is a different problem, please open at AMD Gitlab.

Launchpad Janitor (janitor) wrote : | #107 |
Status changed to 'Confirmed' because the bug affects multiple users.
Changed in mesa (Ubuntu): | |
status: | New → Confirmed |

Pirouette Cacahuète (lissyx) wrote : | #108 |
There's 6.5.0-15 package incoming on mantic-update, does it contains the fix?

Timo Aaltonen (tjaalton) wrote : | #109 |
no, -17 does

|
#110 |
I am pretty sure I have amdgpu.mcbp=0 set
and after doing Ubuntu 24.04 LTS , just doing just about anything crashes the GPU
open web browser = crash , then I have to ssh in and restart desktop session
GL_VENDOR: AMD
GL_RENDERER: AMD Radeon RX 6800 XT (radeonsi, navi21, LLVM 15.0.7, DRM 3.57, 6.8.0-31-generic)
GL_VERSION: 4.6 (Compatibility Profile) Mesa 24.2~git2406010
6.8.0-31-generic
[ 26.417827] [drm] amdgpu kernel modesetting enabled.
[ 26.431708] amdgpu: Virtual CRAT table created for CPU
[ 26.431727] amdgpu: Topology: Add CPU node
[ 26.431934] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1043:0x04F0 0xC1).
[ 26.431949] [drm] register mmio base: 0xFC900000
[ 26.431951] [drm] register mmio size: 1048576
[ 26.435975] [drm] add ip block number 0 <nv_common>
[ 26.435978] [drm] add ip block number 1 <gmc_v10_0>
[ 26.435980] [drm] add ip block number 2 <navi10_ih>
[ 26.435982] [drm] add ip block number 3 <psp>
[ 26.435983] [drm] add ip block number 4 <smu>
[ 26.435985] [drm] add ip block number 5 <dm>
[ 26.435986] [drm] add ip block number 6 <gfx_v10_0>
[ 26.435988] [drm] add ip block number 7 <sdma_v5_2>
[ 26.435990] [drm] add ip block number 8 <vcn_v3_0>
[ 26.435996] [drm] add ip block number 9 <jpeg_v3_0>
[ 26.436013] amdgpu 0000:0e:00.0: No more image in the PCI ROM
[ 26.436028] amdgpu 0000:0e:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 26.436031] amdgpu: ATOM BIOS: 115-D412BS0-101
[ 26.473962] [drm] VCN(0) decode is enabled in VM mode
[ 26.473965] [drm] VCN(1) decode is enabled in VM mode
[ 26.473967] [drm] VCN(0) encode is enabled in VM mode
[ 26.473968] [drm] VCN(1) encode is enabled in VM mode
[ 26.477565] [drm] JPEG decode is enabled in VM mode
[ 26.477596] amdgpu 0000:0e:00.0: vgaarb: deactivate vga console
[ 26.478479] Console: switching to colour dummy device 80x25
[ 26.478490] amdgpu 0000:0e:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 26.478548] amdgpu 0000:0e:00.0: amdgpu: MEM ECC is not presented.
[ 26.478550] amdgpu 0000:0e:00.0: amdgpu: SRAM ECC is not presented.
[ 26.478570] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 26.478577] amdgpu 0000:0e:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 26.478580] amdgpu 0000:0e:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 26.478588] [drm] Detected VRAM RAM=16368M, BAR=256M
[ 26.478589] [drm] RAM width 256bits GDDR6
[ 26.478734] [drm] amdgpu: 16368M of VRAM memory ready
[ 26.478739] [drm] amdgpu: 64363M of GTT memory ready.
[ 26.478768] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 26.478919] [drm] PCIE GART of 512M enabled (table at 0x0000008000900
[ 27.968739] amdgpu 0000:0e:00.0: amdgpu: STB initialized to 2048 entries
[ 27.969354] [drm] Loading DMUB firmware via PSP: version=0x02020020
[ 27.969777] [drm] use_doorbell being set to: [true]
[ 27.969791] [drm] use_doorbell being set to: [true]
[ 27.969803] [drm] use_doorbell being set to: [true]
[ ...

|
#111 |
#100:
You have a GFX10 product, this is not affected by amdgpu.mcbp=0/1. That's only for GFX9. Please open your own issue for it. Also in the kernel bug tracker please only report issues with mainline kernels. 6.8 is already EoL.

|
#112 |
issue seems to only be with xorg , used wayland today and could not trigger it

|
#113 |
and 6.9.3 also crashed

Pirouette Cacahuète (lissyx) wrote : | #114 |
I'm on 6.11 since I moved to 24.10, but I have not experience the issue for quite some time even when on 24.04
Error message: PROTECTION_ FAULT_ADDR 0x00000000 PROTECTION_ FAULT_STATUS 0x0604800C job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=37241, emitted seq=37244
[Dec 5 22:08] amdgpu 0000:23:00.0: GPU fault detected: 146 0x0000480c for process yuzu pid 2920 thread yuzu:cs0 pid 2935
[ +0.000005] amdgpu 0000:23:00.0: VM_CONTEXT1_
[ +0.000002] amdgpu 0000:23:00.0: VM_CONTEXT1_
[ +0.000003] amdgpu 0000:23:00.0: VM fault (0x0c, vmid 3, pasid 32770) at page 0, read from 'TC4' (0x54433400) (72)
[ +10.053011] [drm:amdgpu_
[ +0.000007] [drm] GPU recovery disabled.
How to reproduce the issue:
1. Playing with yuzu-emulator
2. Load Super Mario Odyssey
3. Start new game
4. When Mario is about to jump for the first time after being woken up by Cappy, this bug must occur.
During the issue, the following occured:
1. Graphic locked up.
2. System can be access through SSH.
System specification:
Debian Sid
Radeon RX 580
I have tried the following combination:
1. Kernel 4.17, 4.18, 4.19, 4.20, drm-next-4.21.wip
2. Mesa 18.2, 18.3, 19.0-development branch
But none of the above combination fixes the issue. Let me know if you need more information and more testing from me.