We have an ubuntu server running a set of eight Samsung 980 Pro PCIe 4.0 NVMe SSDs (model MZ-V8P1T0BW) on Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-88-generic x86_64). We've seen this happen at least 5 times over the past month, and not always on the same SSD. We first saw it happen on 5.4.0-81. Some samples from dmesg are below.
This is a production system that runs a set of virtual desktop instances. Thankfully we use these in a zfs pool with four pairs of RAID 1 vdevs, so the only outage we've had so far is when it hit both members of a mirrored pair. After a reboot the SSDs come back up.
We have an ubuntu server running a set of eight Samsung 980 Pro PCIe 4.0 NVMe SSDs (model MZ-V8P1T0BW) on Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-88-generic x86_64). We've seen this happen at least 5 times over the past month, and not always on the same SSD. We first saw it happen on 5.4.0-81. Some samples from dmesg are below.
This is a production system that runs a set of virtual desktop instances. Thankfully we use these in a zfs pool with four pairs of RAID 1 vdevs, so the only outage we've had so far is when it hit both members of a mirrored pair. After a reboot the SSDs come back up.
[Mon Sep 6 12:58:36 2021] nvme nvme5: I/O 132 QID 46 timeout, aborting
[Mon Sep 6 12:58:37 2021] nvme nvme5: I/O 133 QID 46 timeout, aborting
[Mon Sep 6 12:58:39 2021] nvme nvme5: I/O 134 QID 46 timeout, aborting
[Mon Sep 6 12:58:40 2021] nvme nvme5: I/O 135 QID 46 timeout, aborting
[Mon Sep 6 12:58:40 2021] nvme nvme5: I/O 784 QID 48 timeout, aborting
[Mon Sep 6 12:58:41 2021] nvme nvme5: I/O 136 QID 46 timeout, aborting
[Mon Sep 6 12:58:41 2021] nvme nvme5: I/O 137 QID 46 timeout, aborting
[Mon Sep 6 12:58:42 2021] nvme nvme5: I/O 492 QID 28 timeout, aborting
[Mon Sep 6 12:59:07 2021] nvme nvme5: I/O 132 QID 46 timeout, reset controller
[Mon Sep 6 12:59:38 2021] nvme nvme5: I/O 24 QID 0 timeout, reset controller
[Mon Sep 6 13:00:29 2021] nvme nvme5: Device not ready; aborting reset
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep 6 13:00:33 2021] INFO: task txg_quiesce:2172 blocked for more than 120 seconds.
[Mon Sep 6 13:00:33 2021] Tainted: P OE 5.4.0-81-generic #91-Ubuntu
[Tue Sep 21 21:18:36 2021] nvme nvme2: I/O 175 QID 38 timeout, aborting
[Tue Sep 21 21:18:37 2021] nvme nvme2: I/O 240 QID 26 timeout, aborting
[Tue Sep 21 21:18:47 2021] nvme nvme2: I/O 718 QID 23 timeout, aborting
[Tue Sep 21 21:18:56 2021] nvme nvme2: I/O 719 QID 23 timeout, aborting
[Tue Sep 21 21:19:06 2021] nvme nvme2: I/O 175 QID 38 timeout, reset controller
[Tue Sep 21 21:19:37 2021] nvme nvme2: I/O 17 QID 0 timeout, reset controller
[Tue Sep 21 21:20:27 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:47 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:47 2021] nvme nvme2: Removing after probe failure status: -19
[Tue Sep 21 21:21:08 2021] nvme nvme2: Device not ready; aborting reset
[Tue Oct 5 16:54:59 2021] nvme nvme6: I/O 1013 QID 38 timeout, aborting
[Tue Oct 5 16:54:59 2021] nvme nvme6: I/O 727 QID 39 timeout, aborting
[Tue Oct 5 16:55:03 2021] nvme nvme6: I/O 1014 QID 38 timeout, aborting
[Tue Oct 5 16:55:05 2021] nvme nvme6: I/O 1015 QID 38 timeout, aborting
[Tue Oct 5 16:55:25 2021] nvme nvme6: I/O 15 QID 21 timeout, aborting
[Tue Oct 5 16:55:25 2021] nvme nvme6: I/O 408 QID 37 timeout, aborting
[Tue Oct 5 16:55:29 2021] nvme nvme6: I/O 1013 QID 38 timeout, reset controller
[Tue Oct 5 16:55:59 2021] nvme nvme6: I/O 11 QID 0 timeout, reset controller
[Tue Oct 5 16:56:51 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct 5 16:57:11 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct 5 16:57:11 2021] nvme nvme6: Removing after probe failure status: -19
[Tue Oct 5 16:57:32 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct 5 16:57:32 2021] blk_update_request: I/O error, dev nvme6n1, sector 842198232 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
[Mon Oct 11 12:14:38 2021] nvme nvme2: I/O 306 QID 48 timeout, aborting
[Mon Oct 11 12:14:39 2021] nvme nvme2: I/O 827 QID 14 timeout, aborting
[Mon Oct 11 12:15:01 2021] nvme nvme2: I/O 828 QID 14 timeout, aborting
[Mon Oct 11 12:15:05 2021] nvme nvme2: I/O 829 QID 14 timeout, aborting
[Mon Oct 11 12:15:07 2021] nvme nvme2: I/O 830 QID 14 timeout, aborting
[Mon Oct 11 12:15:08 2021] nvme nvme2: I/O 306 QID 48 timeout, reset controller
[Mon Oct 11 12:15:38 2021] nvme nvme2: I/O 20 QID 0 timeout, reset controller
[Mon Oct 11 12:16:29 2021] nvme nvme2: Device not ready; aborting reset
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:50 2021] nvme nvme2: Device not ready; aborting reset
[Mon Oct 11 12:16:50 2021] nvme nvme2: Removing after probe failure status: -19
[Mon Oct 11 12:17:10 2021] nvme nvme2: Device not ready; aborting reset
[Mon Oct 11 12:17:10 2021] blk_update_request: I/O error, dev nvme2n1, sector 1159355592 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Mon Oct 11 12:17:10 2021] blk_update_request: I/O error, dev nvme2n1, sector 992254136 op 0x1:(WRITE) flags 0x0 phys_seg 3 prio class 0