Gpu watchdog segfault and video+kbd+mouse freeze on optiplex 7060 intel gpu
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
Running up-to-date Ubuntu-18.04.3 with kernel 5.3.0-26 on a Dell Optiplex 7060 with an i7-8700 CPU and Intel UHD Graphics 630 (Coffeelake 3x8 GT2).
I had chrome, slack and vmware-player running in Gnome. While doing some git clone, screen+
kernel: show_signal_msg: 2 callbacks suppressed
kernel: GpuWatchdog[20399]: segfault at 0 ip 0000556fd1665ded sp 00007efbf17e46c0 error 6 in chrome[
kernel: Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d
kernel: nvme nvme0: I/O 202 QID 6 timeout, aborting
kernel: nvme nvme0: I/O 203 QID 6 timeout, aborting
kernel: nvme nvme0: I/O 204 QID 6 timeout, aborting
kernel: nvme nvme0: I/O 205 QID 6 timeout, aborting
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: I/O 202 QID 6 timeout, reset controller
kernel: nvme nvme0: 12/0/0 default/read/poll queues
While writing this bug report, the system froze again, and this time it didn't recover. After a cold reset I didn't see any other GpuWatchdog messages in journalctl.
Ubuntu applied a BIOS firmware update before the first freeze, so my BIOS was updated as part of the cold reset I did. Not sure if this is relevant to reproducing the freeze.
Issue occurred again after BIOS update, during make -j12. I also had chrome and vmplayer running. Dmesg errors from journalctl:
kernel: pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0 mask=00001000/ 00002000 mask=00000040/ 00002000
kernel: pcieport 0000:00:1b.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
kernel: pcieport 0000:00:1b.0: AER: device [8086:a340] error status/
kernel: pcieport 0000:00:1b.0: AER: [12] Timeout
kernel: nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
kernel: nvme 0000:01:00.0: AER: device [1344:5410] error status/
kernel: nvme 0000:01:00.0: AER: [ 6] BadTLP
kernel: nvme 0000:01:00.0: AER: Error of this Agent is reported first