First, here's the abridged version of my question. I have a blinking red light on a drive in a RAID array and although MegaCli doesn't report any disk failures or warnings, some MegaCli commands show 24 disks while others show only 23. I also see the following error recurring on a daily basis:
Event Description: Controller encountered a fatal error and was reset
Are these things related? Is there a problem here?
Now here's the longer version. I inherited responsibility for a server (let's call it my_server
) which is being hosted at a data center and which I believe has an LSI MegaRAID SAS 9265-8i with a RAID 50/RAID 5+0 configuration. I received an email from the data center that a red light is blinking on one of the hard disks of this server. Unfortunately I know next to nothing about RAID arrays so I have to feel my way through the MegaRAID SAS Software User's Guide and various online tutorials.
I ssh'ed into the server to try to diagnose the problem. What follows is an example shell session that demonstrates my efforts and gives some relevant information about the system in question.
First I check some basic system information:
$ cat /etc/issue
CentOS release 6.4 (Final)
Kernel \r on an \m
$ uname -a
Linux my_server 2.6.32-358.11.1.el6.x86_64 #1
SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Next I verify the RAID array and MegaCli version:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -aALL | grep "Product Name"
Product Name : LSI MegaRAID SAS 9265-8i
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 | grep 'RAID Level'
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -v
MegaCLI SAS RAID Management Tool Ver 8.04.07 May 28, 2012
(c)Copyright 2011, LSI Corporation, All Rights Reserved.
Exit Code: 0x00
Now some summary information about the drives in the array:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -A8 "Device Present"
Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices : 27
Disks : 24
Critical Disks : 0
Failed Disks : 0
Here it looks like everything is fine. Then I check for S.M.A.R.T. alerts:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep 'S.M.A.R.T.'
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
[...]
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
No S.M.A.R.T. alerts so, after reading a couple of tutorials, I run a few other commands:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -lall -a0 | grep Drives
Number Of Drives : 23
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL | grep -Pi 'SPAN|Span\ Ref|Number\ of'
Number of DISK GROUPS: 1
Number of Spans: 1
SPAN: 0
Span Reference: 0x00
Number of PDs: 23
Number of VDs: 1
Number of dedicated Hotspares: 0
Number Of Drives : 23
Span Depth : 1
Drive's postion: DiskGroup: 0, Span: 0, Arm: 0
Drive's postion: DiskGroup: 0, Span: 0, Arm: 1
Drive's postion: DiskGroup: 0, Span: 0, Arm: 2
Drive's postion: DiskGroup: 0, Span: 0, Arm: 3
[...]
Drive's postion: DiskGroup: 0, Span: 0, Arm: 20
Drive's postion: DiskGroup: 0, Span: 0, Arm: 21
Drive's postion: DiskGroup: 0, Span: 0, Arm: 22
Now I'm a little confused, because some commands (e.g. adpallinfo and pdlist) show 24 disks present and others (e.g. ldinfo and CfgDsply) only show 23.
Finally I generate an event logfile and look for signs of trouble:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpeventlog -getevents -f lsi-events.log -a0 -nolog
$ cat lsi-events.log | grep -P -i 'fail|error|warn'
[...]
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
$ cat lsi-events.log | grep -B6 -A3 -P -i 'fail|error|warn'
[...]
seqNum: 0x000f8644
Time: Sun Feb 26 07:32:16 2017
Code: 0x00000159
Class: 2
Locale: 0x20
Event Description: Controller encountered a fatal error and was reset
Event Data:
===========
None
And also look for messages specifically related to slot 23:
$ cat lsi-events.log | grep -P -i 's23' | tail -30
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Global Hot Spare PD 1f(e0x21/s23) (global,rev) disabled
Event Description: State change on PD 1f(e0x21/s23) from HOT SPARE(2) to UNCONFIGURED_GOOD(0)
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: State change on PD 1f(e0x21/s23) from UNCONFIGURED_GOOD(0) to HOT SPARE(2)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
I contacted the data center and was informed that the blinking light was occurring on drive 10, so I looked at that drive:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDInfo -PhysDrv [33:10] -a0
Enclosure Device ID: 33
Slot Number: 10
Drive's postion: DiskGroup: 0, Span: 0, Arm: 10
Enclosure position: 1
Device Id: 18
WWN: 5000C500344D5940
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Emulated Drive: No
Firmware state: Online, Spun Up
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: 0006
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c500344d5941
SAS Address(1): 0x5000c500344d5942
Connected Port Number: 0(path0) 1(path1)
Inquiry Data: SEAGATE ST32000444SS 00069WM6369D
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :26C (78.80 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Exit Code: 0x00
I also tried using smartctl:
$ sudo smartctl -a -d megaraid,18 /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.11.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: SEAGATE
Product: ST32000444SS
Revision: 0006
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Logical block size: 512 bytes
Logical Unit id: 0x5000c500344d5943
Serial number: 9WM6369D0000914458SC
Device type: disk
Transport protocol: SAS
Local Time is: Tue Feb 28 17:18:33 2017 CST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature: 26 C
Drive Trip Temperature: 68 C
Manufactured in week 21 of year 2011
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 41
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 41
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 3508224337
Blocks received from initiator = 38846232
Blocks read from cache and sent to initiator = 44013719
Number of read and write commands whose size <= segment size = 2649500
Number of read and write commands whose size > segment size = 4
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 45862.30
number of minutes until next internal SMART test = 46
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 22540834 0 0 22540834 22540834 230.346 0
write: 0 0 0 0 0 20.012 0
verify: 161330204 1 0 161330205 161330205 1896.577 0
Non-medium error count: 0
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No self-tests have been logged
Long (extended) Self Test duration: 18500 seconds [308.3 minutes]
Discrepancy you see between Logical Drive view and Physical Devices view is because your drive in slot 23 is configured as Global Hotspare, so it is not assigned to any logical drive and can kick in as Spare into any LD should it go into degraded state. So you have 24 drives physical and 23 assigned to LD 0 with one Global Hotspare.
Regarding the red light blinking on a drive, you should check with DC as to what slot it is and then see details about this drive state with MegaCli -PDInfo -PhysDrv [E:S] -a0
where E is enclosure number and S is slot number. Usually blinking red light is a sign of PFA/SMART imminent failure, although the actual mileage may vary.
On a side note, using grep
to examine results of verbatim human-readable output commands such as MegaCli is a habit that will eventually lead to trouble.
User contributions licensed under CC BY-SA 3.0