Linus/ext4/nvme crashes during high io

2

During mvn compilation, I have random crashes.

The problem seems related to high IO and in kern.log, I can see things like:

kernel: [158430.895045] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
kernel: [158430.951331] blk_update_request: I/O error, dev nvme0n1, sector 819134096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
kernel: [158430.995307] nvme nvme1: Removing after probe failure status: -19
kernel: [158431.035065] blk_update_request: I/O error, dev nvme0n1, sector 253382656 op 0x1:(WRITE) flags 0x4000 phys_seg 127 prio class 0
kernel: [158431.035083] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:309: I/O error 10 writing to inode 3933601 (offset 16777216 size 2101248 starting block 31672832)
kernel: [158431.035085] Buffer I/O error on device nvme0n1p1, logical block 31672320
kernel: [158431.035090] ecryptfs_write_inode_size_to_header: Error writing file size to header; rc = [-5]

To replicate the error, I use:

stress-ng --all 8  --timeout 60s --metrics-brief --tz

I've tried some boot options, like adding acpiphp.disable=1 pcie_aspm=off to /etc/default/grup, this seemed to help stress-ng test, but not my compilation.

  • Distribution: Ubuntu 19.10
  • Kernel: 5.3.0-45-generic #37-Ubuntu SMP Thu Mar 26 20:41:27 UTC 2020

nvme list shows:

Node             SN                   Model                            Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     28FF72PTFQAS         KXG50ZNV256G NVMe TOSHIBA 256GB          1        256,06  GB / 256,06  GB    512   B +  0 B   AADA4102
/dev/nvme1n1     37DS103NTEQT         THNSN5512GPU7 NVMe TOSHIBA 512GB         1         512,11 GB / 512,11  GB    512   B +  0 B   57DC4102
linux
ssd
ext4
io
nvme
asked on Server Fault Apr 6, 2020 by Brimstedt • edited Apr 6, 2020 by Brimstedt

3 Answers

2

I can't exactly tell you where the problem is as this is just a "generic failure" somewhere in NVMe subsystem. But I can suggest what you can try to pinpoint the problem.

  1. Try adding nvme_core.default_ps_max_latency_us=5500 kernel boot option.
  2. Install nvme-cli package (or even better build a most recent one from sources) and check various logs with it, like smart-log and error-log. That might help to diagnose error further.
  3. Try booting some other distros (live) and stress test under them to see if this is kernel version / distro related. Systemrescuecd distro might be a good starting point.
  4. If that doesn't helps you can try updating your MB firmware ("BIOS", which is not BIOS in fact now with UEFI) to a most recent one. While this doesn't sound obvious and even the patch notes might not have anything directly related to NVMe/PCI-E subsystems, sometimes it helps (practical knowledge).
  5. Update your NVMe drive firmware. Look for a vendor supplied tools and manual for this.
  6. If everything above won't help or give any clues you might have faced yet unknown bug or hardware failure.
answered on Server Fault Apr 7, 2020 by NStorm
0

I noticed that the errors only occurred on one of the ssd's, the one containing /home

Moved /home to the other disk in the machine, and so far it seems to be working much better.

answered on Server Fault Apr 6, 2020 by Brimstedt • edited Apr 8, 2020 by Brimstedt
-2

fast thing to just try is hot-swap the harddrive driver.

but for performance IO, u can't go cheap also. Check max latency, see how much your going over. maybe ur just trying something that demands a better driver with the kernel.

look in some cmake config or some compiler agruement to use only 1 thread or less IO, slow it down somehow, if you can use the terminal to pause the process manually, u might be able to simulate a compile, if your very desperate,

only other thing that can be done quick is make VM machine of that machine, and compile it on VM, and debug it on live.

answered on Server Fault Apr 7, 2020 by Georgiy Chipunov

User contributions licensed under CC BY-SA 3.0