linux-firmware 1.197 causes kernel to report error "amdgpu: [gfxhub0] retry page fault"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
AMD |
Fix Released
|
Undecided
|
Unassigned | ||
linux-firmware (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Focal |
New
|
Undecided
|
Juerg Haefliger | ||
Hirsute |
Won't Fix
|
Undecided
|
Juerg Haefliger |
Bug Description
After upgrading linux-firmware from 1.190.5 to 1.197 (as part of the upgrade from Ubuntu 20.10 to 21.04), I started experiencing frequent and severe GPU instability. When this happens, I see this error in dmesg:
[20061.061069] amdgpu 0000:03:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32769, for process Xorg pid 1141 thread Xorg:cs0 pid 1236)
[20061.061103] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x800000401000 from client 27
[20061.061135] amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTIO
[20061.061147] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[20061.061157] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
[20061.061167] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
[20061.061174] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[20061.061183] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
[20061.061189] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
I'll attach a couple of full dmesgs that I collected.
Many of the times when this happens, the screen and keyboard freeze irreversibly (I tried waiting for more than 30 minutes, but it doesn't help). I can still log in via ssh though. When there's no freeze, I can continue using the computer normally, but the laptop fans keep running are always running and the battery depletes fast. There's probably something on a permanent loop either in the kernel or in the GPU.
This bug happens several times a day, rendering the machine so unstable as to be almost unusable. It is a severe regression and I'm aghast that it passed AMD's Quality Assurance.
After downgrading back to linux-firmware 1.190.5, the machine is back to the previous, mostly-reliable state. Which is to say, this bug is gone, I'm just left with the other amdgpu suspend bug I've learned to live with since I bought this computer.
Please revert the amdgpu firmware in this package as soon as possible. This is unbearable.
Relevant information:
Ubuntu version: 21.04
Linux kernel: 5.11.0-17-generic x86_64
CPU model: AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx
GPU: 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c1)
Laptop model: Lenovo Ideapad S145
no longer affects: | mesa (Ubuntu Hirsute) |
Changed in mesa (Ubuntu): | |
status: | Confirmed → Invalid |
Changed in linux-firmware (Ubuntu Hirsute): | |
status: | New → Confirmed |
Changed in linux-firmware (Ubuntu): | |
status: | Confirmed → Invalid |
Changed in linux-firmware (Ubuntu Hirsute): | |
assignee: | nobody → Juerg Haefliger (juergh) |
Changed in linux-firmware (Ubuntu): | |
assignee: | Seth Forshee (sforshee) → nobody |
This is the dmesg of an instance where I was able to continue using the laptop despite the GPU bug (in the case of the dmesg I attached previously, I had to ssh in to the machine to turn it off).
Notice that there are two instances of the retry page fault, one of them right within 15 minutes of the machine being turned on. There was no suspend/resume event this time. The laptop was turned on the whole time.