I have two front end exchange 2003 servers, both are VMs, and both are on different physical Linux boxes running vmware server in my dmz.
A few days ago, all my Nagios alerts timed out for one of them, and the ping checks had a high error rate. On this front-end exchange vm, I saw the following which seems to indicate disk timeouts/problems during this time in the system section of the event viewer (the other sections don't go very far back becomes of spam notices, going to have to fix that):
Event Type: Error
Event Source: vmscsi
Event Category: None
Event ID: 9
Date: 12/12/2009
Time: 9:25:19 AM
User: N/A
Computer: FOO
Description:
The device, \Device\Scsi\vmscsi1, did not respond within the timeout period.
On the Linux host, I don't see anything in /var/log/messages or /var/log/vmware (or anything else really) that gives me any hints. In the sar log, I do see a higher IOWait ~22 at this time then I have seen anywhere else, normally it only spikes to around 11 when the backups run, which they were not during this time. Might this happen from a disk falling out of the array, anyone know how I check that on Poweredge 2950 ( using dset? ).
On the other front-end VM, I got the following (This, I don't really know what it means, master browser?):
Event Type: Error
Event Source: MRxSmb
Event Category: None
Event ID: 8003
Date: 12/12/2009
Time: 9:33:16 AM
User: N/A
Computer: FOO
Description:
The master browser has received a server announcement from the computer FOO02 that believes that it is the master browser for the domain on transport NetBT_Tcpip_{..... The master browser is stopping or an election is being forced.
So besides the above questions, I am really trying to figure out what happened since everything seems to have recovered on its own, any ideas?
Update:
Found this megacli utility which is new to me, but starting the day after this event I see a lot of:
Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 03(e1/s3), CDB: 28 00 0a 8c 60 5d 00 00 08 00, Sense: f0 00 03 0a 8c 60 5d 0a 00 00 00 00 11 00 00 00 00 0
Event Data:
===========
Device ID: 3
Enclosure Index: 1
Slot Number: 3
From /opt/MegaRAID/MegaCli/MegaCli -AdpEventLog -GetEvents -f events.log -aALL && cat events.log
. This doesn't sound good, anyone know what this specifically means?
The master browser event is normal and not related, it can be ignored.
The RAID log is cryptic but since it's lists slot3 then I would assume it's not liking something w/ the drive in slot 3. There should be documentation for that event somewhere on Dell or LSI's site.
You can test the array by running a verification of the array. I'm not sure if you can do it from that utility in the OS, it can be run from the RAID setup utility that is accessed on boot.
If you have a spare slot & drive available then you can put in a new drive, make it a global hot spare, pull the drive in slot3 and let everything fail over to the spare. You can then test/replace the slot3 drive without time pressure.
User contributions licensed under CC BY-SA 3.0