Samsung NVMe disappears when server on average to high load


Hello to the community!

Let me get into the details of the problem itself.

We have a total of 6 SuperMicro Servers in a cluster. The problem occurs only in 3 them, which are the newest generation.

The same thing happened on all three nodes (different times, same day). I am of high doubt that it may be an actual Hardware problem although I am not excluding it as a reason.

Problem Description and Occurrences: seemingly random (only connection so far is that they seem to be under average-to-high load when it happens). What it looks like - all of a sudden one nvme disappears until server is restarted. If we do not restart - high lag is experienced among the whole cluster. Solution 2 to lag - bring the gluster node of the affected machine down.

In the kern.log you could find at first messages for failed nvme, then lots of soft lockup messages: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s!. For some reason there are lots of errors concerning i40e driver as well.

detailed kern.log (Call Trace) (filtered some of the content. I can provide an even more detailed one if needed):

Error recorded in IPMI event Log(on all three servers): Critical Interrupt,PCI PERR @Bus16 (Dev0, Func0) - Assertion

The "failing" ones:

  • Supermicro Super Server/X11SDV-16C-TP8F; Intel(R) Xeon(R) D-2183IT - node 1
  • Supermicro Super Server/X11SDV-16C-TP8F; Intel(R) Xeon(R) D-2183IT - node 2
  • Supermicro Super Server/X11SDV-8C-TP8F; Intel(R) Xeon(R) D-2146NT - node 3
  • All with the same NVMe on PCIe - SAMSUNG MZ1LB3T8HMLA-00007 - 4TB
  • all of the machines are runinng Debian9
  • node 1 - (which is currently free of any VMs) - 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3
  • node 2 and 3 - kernel 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1
  • all ot them running xen-hypervisor-4.8.5, drbd, gluster.

Currently the "backupNode" is node 1 - has no running VMs on it but all resources from node 2 and 3, in case any of them dies on us.

Some errors from dump:

[810899.851097] nvme 0000:1c:00.0: Failed status: 0xffffffff, reset controller.

[810972.208480] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[810972.208482]  [<ffffffff81513fe6>] ? net_rx_action+0x246/0x380
[810972.208485] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[810972.208488]  [<ffffffff81621ead>] ? __do_softirq+0x10d/0x2b0
[810972.208489] {1}[Hardware Error]: event severity: corrected
[810972.208492]  [<ffffffff81081fa2>] ? irq_exit+0xc2/0xd0
[810972.208496] {1}[Hardware Error]:  Error 0, type: corrected
[810972.208499]  [<ffffffff814152c1>] ? xen_evtchn_do_upcall+0x31/0x50
[810972.208502] {1}[Hardware Error]:   section_type: PCIe error
[810972.208504]  [<ffffffff8161f1de>] ? xen_do_hypervisor_callback+0x1e/0x40
[810972.208507] {1}[Hardware Error]:   port_type: 4, root port
[810972.208510]  <EOI> 
[810972.208510] {1}[Hardware Error]:   version: 3.0
[810972.208512]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[810972.208515] {1}[Hardware Error]:   command: 0x0547, status: 0x4010
[810972.208518]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[810972.208521] {1}[Hardware Error]:   device_id: 0000:16:00.0
[810972.208523]  [<ffffffff8101be5c>] ? xen_safe_halt+0xc/0x20
[810972.208525] {1}[Hardware Error]:   slot: 0
[810972.208527]  [<ffffffff8161d74a>] ? default_idle+0x1a/0xd0
[810972.208529] {1}[Hardware Error]:   secondary_bus: 0x17
[810972.208533]  [<ffffffff810bf57a>] ? cpu_startup_entry+0x1ca/0x240
[810972.208536] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2030
[810972.208537] Task dump for CPU 10:
[810972.208539] {1}[Hardware Error]:   class_code: 060400
[810972.208543] ksoftirqd/10    R
[810972.208543] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0013

[811176.155839] i40e 0000:b7:00.1: Error I40E_AQ_RC_EINVAL adding RX filters on PF, promiscuous mode
[811177.445098] nvme 0000:1c:00.0: Refused to change power state, currently in D3
[811178.748864] xen: registering gsi 38 triggering 0 polarity 1
[811178.748889] Already setup the GSI :38

Investigating the problem led me to some old threads in launchpad and askubuntu but not sure if it covers our exact scenario:

What is done so far:

  • We added GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off" on node 1 and 3
  • Simultaneously moved all working VMs away from node 1 (pcie_aspm=off)

So far we have a working environment for 2 days straight without any hangs or missing nvmes but we are unable to figure out where the problem lies.
Bare in mind that node 2 does not have pcie_aspm set to off, is currently under average load, and running more than usual VMs (since they've been devided among node 2 and 3 from 1)

I have a strange feeling that somehow totally unloading node 1 is connected to our temporal success, but cannot find an actual reason for it. At this point we've been banging our heads now for several days and kind of "blocked" on it.

We are looking for some help or opinions. This thing has become ridiculous. Let me know if you need more information.

Your help is appreciated! Thanks!

asked on Server Fault Mar 10, 2020 by Stan

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0