CentOS Raid 1 drive fails on raid-check

1

I have a pair of HP DL320e servers configured identically with two WD Red 6TB drives in a software raid 1 array. The DL320e has an on board raid controller which is disabled in favour of linux software raid.

Both machines seem to work fine and the raid arrays look normal, except that every time raid-check runs (1am Sunday from the default weekly crontab, but also if I run raid-check manually) one drive goes off line. After this the "failed" drive's device files have been removed (e.g. /dev/sda2), but after a cold reboot they reappear and the "failed" drive can be added back into the array and seems to work normally.

This has been going on since the (brand new) machines and disks were installed a couple of months ago. According to smartctl none of the drives have any bad sectors swapped out so on the basis of a couple of posts elsewhere I have tried using hdparm to write over the sectors identified in /var/log/messages in order to force the drive to detect and swap out bad sectors with no effect.

I have also tried writing zeros over the whole of /dev/sdb2 and /dev/sdb3 using dd. This completed without causing any errors but did not cause any bad sectors to be swapped out, but does seem to indicate that the whole of the drive surface can be successfully written to.

I have run all the smart diagnostics using smartctl and all complete OK.

Since the machines were all installed from new and the failure has occurred on both systems and at least 3 of the 4 drives have "failed" (both drives on one machine have failed at different times) I am unwilling to believe that this is caused by bad hardware. The fact that a dd of /dev/zero to the whole of one failed drive completed proves that the drive is writable over the whole surface.

The drives are configured with 3 partitions, biosboot, /boot and root + /home.

The logs from both servers are more or less identical although they report different sector numbers, and the sector numbers reported from week to week on the same server also differ.

/proc/mdstat reports

sh-4.2# cat /proc/mdstat
Personalities : [raid1] 
md126 : active raid1 sda3[0] sdb3[1]
      5859876672 blocks super 1.2 [2/2] [UU]
      bitmap: 2/44 pages [8KB], 65536KB chunk

md127 : active raid1 sda2[2] sdb2[3]
      511936 blocks super 1.0 [2/2] [UU]

unused devices: <none>
sh-4.2# 

Time passes until 1am Sunday and then:

WARNING: Your hard drive is failing
Device: /dev/sda [SAT], unable to open device

sh-4.2# cat /proc/mdstat
Personalities : [raid1] 
md126 : active raid1 sda3[0](F) sdb3[1]
      5859876672 blocks super 1.2 [2/1] [_U]
      bitmap: 5/44 pages [20KB], 65536KB chunk

md127 : active raid1 sda2[2](F) sdb2[3]
      511936 blocks super 1.0 [2/1] [_U]

unused devices: <none>
sh-4.2#

/var/log/messages reports

Jun  7 01:00:01 1000 kernel: md: data-check of RAID array md126
Jun  7 01:00:01 1000 kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Jun  7 01:00:01 1000 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jun  7 01:00:01 1000 kernel: md: using 128k window, over a total of 5859876672k.
Jun  7 01:00:07 1000 kernel: md: delaying data-check of md127 until md126 has finished (they share one or more physical units)
Jun  7 01:01:01 1000 systemd: Starting Session 1544 of user root.
Jun  7 01:01:01 1000 systemd: Started Session 1544 of user root.
Jun  7 01:03:43 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen
Jun  7 01:03:43 1000 kernel: ata1: SError: { CommWake }
Jun  7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun  7 01:03:43 1000 kernel: ata1.00: cmd 60/80:00:80:1b:70/00:00:03:00:00/40 tag 0 ncq 65536 in
         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  7 01:03:43 1000 kernel: ata1.00: status: { DRDY }
Jun  7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun  7 01:03:43 1000 kernel: ata1.00: cmd 60/80:08:00:1c:70/00:00:03:00:00/40 tag 1 ncq 65536 in
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  7 01:03:43 1000 kernel: ata1.00: status: { DRDY }
Jun  7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun  7 01:03:43 1000 kernel: ata1.00: cmd 60/80:10:00:0d:70/00:00:03:00:00/40 tag 2 ncq 65536 in
         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

repeated with increasing values of tag up to 30 followed by

Jun  7 01:07:10 1000 kernel: ata1.00: cmd 60/80:f0:00:cd:7f/00:00:06:00:00/40 tag 30 ncq 65536 in
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  7 01:07:10 1000 kernel: ata1.00: status: { DRDY }
Jun  7 01:07:10 1000 kernel: ata1: hard resetting link
Jun  7 01:07:11 1000 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun  7 01:07:11 1000 kernel: ata1.00: configured for UDMA/133
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:07:11 1000 kernel: ata1: EH complete
Jun  7 01:09:53 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen
Jun  7 01:09:53 1000 kernel: ata1: SError: { CommWake }
Jun  7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun  7 01:09:53 1000 kernel: ata1.00: cmd 60/80:00:80:f6:dd/00:00:08:00:00/40 tag 0 ncq 65536 in
         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
Jun  7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun  7 01:09:53 1000 kernel: ata1.00: cmd 60/80:08:00:f7:dd/00:00:08:00:00/40 tag 1 ncq 65536 in
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  7 01:09:53 1000 kernel: ata1.00: status: { DRDY }

More repeats up to tag 30 then

Jun  7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun  7 01:09:53 1000 kernel: ata1.00: cmd 60/80:f0:00:f6:dd/00:00:08:00:00/40 tag 30 ncq 65536 in
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun  7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
Jun  7 01:09:53 1000 kernel: ata1: hard resetting link
Jun  7 01:09:59 1000 kernel: ata1: link is slow to respond, please be patient (ready=0)
Jun  7 01:10:01 1000 systemd: Starting Session 1545 of user root.
Jun  7 01:10:01 1000 systemd: Started Session 1545 of user root.
Jun  7 01:10:03 1000 kernel: ata1: COMRESET failed (errno=-16)
Jun  7 01:10:03 1000 kernel: ata1: hard resetting link
Jun  7 01:10:04 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Jun  7 01:10:09 1000 kernel: ata1: hard resetting link
Jun  7 01:10:09 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Jun  7 01:10:09 1000 kernel: ata1: limiting SATA link speed to 1.5 Gbps
Jun  7 01:10:14 1000 kernel: ata1: hard resetting link
Jun  7 01:10:14 1000 kernel: ata1: SATA link down (SStatus 0 SControl 310)
Jun  7 01:10:14 1000 kernel: ata1.00: disabled
Jun  7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0

One more block of

Jun  7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun  7 01:10:14 1000 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun  7 01:10:14 1000 kernel: Sense Key : Aborted Command [current] [descriptor]
Jun  7 01:10:14 1000 kernel: Descriptor sense data with sense descriptors (in hex):
Jun  7 01:10:14 1000 kernel:        72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun  7 01:10:14 1000 kernel:        00 00 00 00
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun  7 01:10:14 1000 kernel: Add. Sense: No additional sense information
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] CDB:
Jun  7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd f6 80 00 00 00 80 00 00
Jun  7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148764288
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: rejecting I/O to offline device
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] killing request
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]

and finally

Jun  7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd fb 00 00 00 00 80 00 00
Jun  7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148765440
Jun  7 01:10:14 1000 kernel: ata1: EH complete
Jun  7 01:10:14 1000 kernel: md: super_written gets error=-5, uptodate=0
Jun  7 01:10:14 1000 kernel: md/raid1:md126: Disk failure on sda3, disabling device.
md/raid1:md126: Operation continuing on 1 devices.
Jun  7 01:10:14 1000 kernel: ata1.00: detaching (SCSI 0:0:0:0)
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Stopping disk
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] START_STOP FAILED
Jun  7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun  7 01:10:14 1000 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun  7 01:10:14 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun  7 01:10:14 1000 kernel: md: md126: data-check interrupted.
Jun  7 01:10:14 1000 kernel: md: super_written gets error=-19, uptodate=0
Jun  7 01:10:14 1000 kernel: md/raid1:md127: Disk failure on sda2, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
Jun  7 01:10:15 1000 kernel: md: md127 still in use.
Jun  7 01:10:15 1000 kernel: md: md126 still in use.
Jun  7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun  7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun  7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun  7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun  7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun  7 01:20:01 1000 systemd: Created slice user-0.slice.
Jun  7 01:20:01 1000 systemd: Starting Session 1546 of user root.
Jun  7 01:20:01 1000 systemd: Started Session 1546 of user root.
Jun  7 01:30:01 1000 systemd: Created slice user-0.slice.
Jun  7 01:30:01 1000 systemd: Starting Session 1547 of user root.
Jun  7 01:30:01 1000 systemd: Started Session 1547 of user root.
Jun  7 01:36:58 1000 smartd[977]: Device: /dev/sda [SAT], open() failed: No such device
Jun  7 01:36:58 1000 smartd[977]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ...
Jun  7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root produced unexpected output (80 bytes) to STDOUT/STDERR:
Jun  7 01:36:58 1000 smartd[977]: /usr/libexec/smartmontools/smartdnotify: line 13: /dev/pts/0: Permission denied
Jun  7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful

If anyone can suggest what might be going wrong for me here I would be very grateful.

raid1
centos7
asked on Server Fault Jun 8, 2015 by Bruce Skingle

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0