What causes SCSI command timeout over 30 seconds?

3

System description:
CentOS 7.
SuperMicro X9DRW-iF board.
Latest BIOS (and SCU option ROM), using SCU option ROM.
Four 300GB WD HDDs connected to the board's SCU.
Six HDDs connected to the six individual SATA ports.
Normal software running for our business.

I set up this system and twice in two weeks, I found that all four drives connected to the SCU were "gone", meaning that ls -l'ing their mount points showed

ls: cannot access disk1: Input/output error
ls: cannot access disk2: Input/output error
ls: cannot access disk3: Input/output error
ls: cannot access disk4: Input/output error

I saw from dmesg that isci was abandoning commands, attempting and failing to reset devices, etc. (dmesg output is at the bottom.)

From RedHat's Controlling the SCSI Command Timer and Device Status page, I think that this is the "expected" error handling sequence of events when a SCSI command times out and when the device can't be reset. However, I'm not sure if it's the bus or the host that's getting reset. But it affects all four drives on the SCU.

I read this ServerFault question that I think identifies this problem as a "loop-reset", but I haven't been able to find more information about loop resets.

I verified that the SCSI command timeouts for those four drives are 30 seconds, the default for RHEL/CentOS 7. (cat /sys/block/sda/device/timeout) Lowering the timeout to 1 second or 0 seconds can quickly cause one drive to be disabled, but I have rarely if ever been able to get one drive failure to cause all four drives on the SCU to be disabled.

So here are some of my questions. The end goal is to configure the drives so that this loop-reset or "get rid of all SCU drives" doesn't happen.

  1. What causes a SCSI command to take more than 30 seconds? That seems awfully long.
  2. What would cause all four drives on the SCU to be affected? This implies that the device reset failed and that the error handler went on to reset the bus/host, according to the RedHat article. What would cause the drive reset to fail?
  3. What kinds of things might make this happen rarely, on the order of once per week, like I've observed?
  4. Are SCSI command timeouts acceptable, or are they to be avoided by increasing the timeout time? Of maximum importance to me is keeping the drives online; waiting a bit for a response from the drive isn't a huge problem.
  5. Have I correctly diagnosed this as an "expected" response to a SCSI command timeout? (i.e. Is this a SCSI command timeout?)
  6. What would make the drives on the SCU more/less responsive than the drives connected to the individual SATA ports? I haven't had problems with them.
  7. Does command queueing have anything to do with this? The queue depth is 31 for all four drives.
  8. What are recommended values for SCSI command timeouts for drives on an SCU?
  9. At the end of the day, how can I reconfigure to avoid this issue?

Thanks for any and all help. Dmesg output (edited for brevity) from the first failure is copied below. Note that the drives are sda,b,c,d and ata7,8,9,10.

17:40:41: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964490 (STP/SATA), task = ffff881fcfc95900, old_request == ffff881fcea13000
17:40:51: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single
17:41:01: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490
17:41:01: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF
17:41:01: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964490, task = ffff881fcfc95900 , old_request == ffff881fcea13000
17:41:01: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964310 (STP/SATA), task = ffff881fd0ce97c0, old_request == ffff881fcf125000
17:41:11: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single
17:41:21: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310
17:41:21: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF
17:41:21: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964310, task = ffff881fd0ce97c0 , old_request == ffff881fcf125000
17:43:50: INFO: task app:20227 blocked for more than 120 seconds.
17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:43:50: app           D ffff881fffcd3680     0 20227  13315 0x00000080
17:43:50: ffff883fcb15f8e0 0000000000000082 ffff883fcb15ffd8 0000000000013680
17:43:50: ffff883fcb15ffd8 0000000000013680 ffff881fd1434fa0 ffff883fc03d13b0
17:43:50: 7fffffffffffffff ffff881fd1434fa0 0000000000000002 0000000000000000
17:43:50: Call Trace:
17:43:50: [<ffffffff81609e39>] schedule+0x29/0x70
17:43:50: [<ffffffff81607d79>] schedule_timeout+0x209/0x2d0
17:43:50: [<ffffffff81609312>] __down_common+0xd2/0x14a
17:43:50: [<ffffffffa02292dd>] ? _xfs_buf_find+0x16d/0x2c0 [xfs]
17:43:50: [<ffffffff816093a7>] __down+0x1d/0x1f
17:43:50: [<ffffffff8109d311>] down+0x41/0x50
17:43:50: [<ffffffffa02290dc>] xfs_buf_lock+0x3c/0xd0 [xfs]
17:43:50: [<ffffffffa02292dd>] _xfs_buf_find+0x16d/0x2c0 [xfs]
17:43:50: [<ffffffffa022945a>] xfs_buf_get_map+0x2a/0x180 [xfs]
17:43:50: [<ffffffffa0229eec>] xfs_buf_read_map+0x2c/0x140 [xfs]
17:43:50: [<ffffffffa028d809>] xfs_trans_read_buf_map+0x2d9/0x4a0 [xfs]
17:43:50: [<ffffffffa02723cd>] xfs_read_agi+0x9d/0x110 [xfs]
17:43:50: [<ffffffffa0272474>] xfs_ialloc_read_agi+0x34/0xd0 [xfs]
17:43:50: [<ffffffffa0273138>] xfs_dialloc+0xe8/0x270 [xfs]
17:43:50: [<ffffffffa02757c1>] xfs_ialloc+0x71/0x6a0 [xfs]
17:43:50: [<ffffffffa0242b77>] ? kmem_zone_alloc+0x77/0x100 [xfs]
17:43:50: [<ffffffffa0275e6a>] xfs_dir_ialloc+0x7a/0x280 [xfs]
17:43:50: [<ffffffff81609222>] ? down_write+0x12/0x30
17:43:50: [<ffffffffa027656a>] xfs_create+0x48a/0x680 [xfs]
17:43:50: [<ffffffffa0237abb>] xfs_vn_mknod+0xbb/0x1e0 [xfs]
17:43:50: [<ffffffffa0237bf6>] xfs_vn_mkdir+0x16/0x20 [xfs]
17:43:50: [<ffffffff811d26c7>] vfs_mkdir+0xb7/0x160
17:43:50: [<ffffffff811d853f>] SyS_mkdirat+0x6f/0xe0
17:43:50: [<ffffffff811d85c9>] SyS_mkdir+0x19/0x20
17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b
17:43:50: INFO: task app:20228 blocked for more than 120 seconds.
17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:43:50: app           D ffff881fffc93680     0 20228  13315 0x00000080
17:43:50: ffff883fcef7b9e8 0000000000000082 ffff883fcef7bfd8 0000000000013680
17:43:50: ffff883fcef7bfd8 0000000000013680 ffff883fcc1d0b60 ffff881fffc93f48
17:43:50: ffff881fd07ea800 ffff883fcc1d0b60 0000000000000000 0000000000000000
17:43:50: Call Trace:
17:43:50: [<ffffffff8160a13d>] io_schedule+0x9d/0x140
17:43:50: [<ffffffff81204593>] do_blockdev_direct_IO+0xc03/0x2620
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffff81206005>] __blockdev_direct_IO+0x55/0x60
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs]
17:43:50: [<ffffffffa0221b2a>] xfs_vm_direct_IO+0xda/0x180 [xfs]
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs]
17:43:50: [<ffffffff8115872d>] generic_file_direct_write+0xcd/0x190
17:43:50: [<ffffffffa0299636>] xfs_file_dio_aio_write+0x215/0x254 [xfs]
17:43:50: [<ffffffffa022ee0d>] xfs_file_aio_write+0x13d/0x150 [xfs]
17:43:50: [<ffffffff811c64cd>] do_sync_write+0x8d/0xd0
17:43:50: [<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0
17:43:50: [<ffffffff811c76b8>] SyS_write+0x58/0xb0
17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b

... <Many more sequences like the above.>...

17:54:21: ata7.00: exception Emask 0x0 SAct 0x7e0000ff SErr 0x0 action 0x6 frozen
17:54:21: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
17:54:21: ata8.00: failed command: FLUSH CACHE EXT
17:54:21: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 30
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata8.00: status: { DRDY }
17:54:21: ata8: hard resetting link
17:54:21: ata9.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen
17:54:21: ata10.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out
     res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out
     res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:22: ata7.00: status: { DRDY }
17:54:22: ata7: hard resetting link
17:54:24: perf interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:32: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962ab8; hard reset failed (0x21)
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962bb0; hard reset failed (0x21)
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9629c0; hard reset failed (0x21)
17:54:57: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9628c8; hard reset failed (0x21)
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce9643d0
17:55:06: sas: ata8: end_device-0:1: reset failed (errno=-16)
17:55:06: sas: ata10: end_device-0:3: reset failed (errno=-16)
17:55:06: ata10: hard resetting link
17:55:06: ata8: hard resetting link
17:55:06: sas: ata9: end_device-0:2: reset failed (errno=-16)
17:55:06: ata9: hard resetting link
17:55:07: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250
17:55:07: sas: ata7: end_device-0:0: reset failed (errno=-16)
17:55:07: ata7: hard resetting link

...<More of the above.>...


17:56:22: sas: ata10: end_device-0:3: reset failed (errno=-16)
17:56:22: ata10: reset failed, giving up
17:56:22: sas: ata8: end_device-0:1: reset failed (errno=-16)
17:56:22: ata8: reset failed, giving up
17:56:22: ata8.00: disabled
17:56:22: ata8.00: device reported invalid CHS sector 0
17:56:22: ata8: EH complete
17:56:22: ata10.00: disabled
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10: EH complete
17:56:22: sas: ata9: end_device-0:2: reset failed (errno=-16)
17:56:22: ata9: reset failed, giving up
17:56:22: ata9.00: disabled
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9: EH complete
17:56:22: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250
17:56:22: sas: ata7: end_device-0:0: reset failed (errno=-16)
17:56:22: ata7: reset failed, giving up
17:56:22: ata7.00: disabled
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7: EH complete
17:56:22: sd 0:0:0:0: [sda]
17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
17:56:22: sd 0:0:0:0: [sda] CDB:
17:56:22: Write(10): 2a 00 08 70 08 01 00 00 02 00
17:56:22: end_request: I/O error, dev sda, sector 141559809
17:56:22: sd 0:0:0:0: [sda]
17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
17:56:22: XFS (sda1): metadata I/O error: block 0x8700001 ("xfs_buf_iodone_callbacks") error 5 numblks 1
17:56:22: XFS (sdc1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sdd1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sda1): metadata I/O error: block 0x8700002 ("xfs_buf_iodone_callbacks") error 5 numblks 1
17:56:22: XFS (sdc1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sdd1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: Buffer I/O error on device sdb1, logical block 35424025
17:56:22: lost page write due to I/O error on sdb1
linux
hard-drive
timeout
scsi
asked on Server Fault Dec 4, 2015 by jqr • edited Apr 13, 2017 by Community

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0