SBS 2011 Crashing, unknown cause

0

I have a SBS 2011 server that started crashing a few days ago. This issue occurred on a Sunday night at 11:55pm of a holiday long weekend. There would have been almost no one accessing the server at that time. I have not made any changes to the server in weeks. The last changes were regular updates from MS which did not cause any issues.

When it crashed and it got stuck in a boot sequence where it would blue-screen trying to Apply Computer Settings. There is no error-code given when the system blue-screens, at least that I can capture. During the reboot I am connecting using Dell DRAC as the server is located in a facility I do not have physical access to. I see only the very start of the error message, but either there is no further information or it is cut off from my console session.

I was able to get into safe mode and safe mode with networking with no issues.

I was able to get back into Windows normally booting it once, but I think i just got lucky as then next 2 boots also blue-screened at Applying Computer settings.

I logged a ticked with MS and we have been working on the issue for 2 days with no success. I am reaching out for help here hoping someone has any ideas.

I got back into Windows normally using Last Known Good Config, but it after a couple more reboots, the issue came back. There is nothing unusual appearing in the System or Application Event logs before the system crashes, only informational events.

We discovered a strange issue where the network logon service was not starting (this had never happened before). MS determined that somehow the hostname of the computer was changed in a couple of places in the registry. We disabled Exchange services as they were also failing due to the Network Logon Service failing to start. Once we modified the registry settings back to the actual name of the server, the network logon service started up again normally.

Thinking the issue was fixed, we began restarting the Exchange services and then we crashed again when about half of them were started up. We rebooted and then got a couple more started and then crashed again.

MS then tried to disable 3rd party drivers and storage drivers (the ones that don't load in safe mode) but the server was unstable in that state. My MS engineer then quit for the night.

I had the data center run a full diagnostic on the hardware which came back clean.

I disabled all Exchange services again, and behold it has not crashed since.

So, any ideas?

I can't get the idea out of my head that it is related to RAM. This server is very undersized; it's running 8 GB RAM. Even with Exchange disabled 6.5 GB of RAM is used up just booting to the desktop. The server is a Dell PE2950 with 1 Quad-Core processor (2.33 GHz), and a 3 disk RAID 5 volume for the server. There is also a standalone drive which I use for local backups.

My thought was that as services were starting up, and RAM was being given to processes, that it encountered some issue with the physical module, or that the page file filled up and somehow causing the crash. Is this valid reasoning?

Another thought was that registry entry that was changed which was causing the network logon service to fail. The name of the server that was appearing in the registry was generic, like WIN-67L5UNORI4I.

I scanned the security logs for failed logon attempts and I see similar PC names appearing from strange IP addresses (China, South Korea, Brazil, Germany).

Could someone have gained access and caused some damage that is making it crash?

disabling the Auto restart enabled me to see the BSOD error screen. KERNEL_DATA_INPAGE_ERROR ... Technical Information: STOP: 0x0000007A (0xFFFFF6FC4000A9D0, 0xFFFFFFFFC000000E, 0x0000000137CDF860, 0xFFFFF8800153A758 ... *** Ntfs.sys - Address FFFFF8800153A758 at base FFFFF8800144C000, Datestamp 5167f5fc

Note, this seems to have happened more than once while trying to start the exchange rpcclientaccess service. the service hangs upon starting but a process is created which keeps taking more and more RAM, then crashes the server

Any advice you can give would be great.

Thanks!

exchange
windows-sbs
asked on Server Fault Oct 16, 2013 by mpethe • edited Oct 16, 2013 by Dave M

2 Answers

0

Blue screen / BSOD most common causes:

  • bad RAM (run a memory tester for a few hours to stress test all RAM)
  • other failing hardware (motherboard or ?)
  • corruption of a driver // least likely

Use autoruns, and try to disable any drivers and services that you do not need.

However, in a situation like this, if a cleanup pass from autoruns doesn't nail it, the most frequent solution is to build up a new server. (new hardware, new load of OS...)

0

Thank you, everyone for contributing.

Even though the hardware diagnostics came back clean when the Data Center ran them, that was misleading.

We updated the firmware on all hardware devices. As soon as the server booted back up into Windows I noticed that one of the drives of the RAID array had failed. We swaped the drive and all of the issues disapperaed.

I had to repair and remount the exchange DB, but it's working fine now.

Some combination of the outdated firmware and a failing HD seem to be the culprit here.

When the server was starting with exchange services set to auto it was trying to mount the DB, I suppose accessing the a portion of the failing HD and causing it to crash.

answered on Server Fault Oct 17, 2013 by mpethe

User contributions licensed under CC BY-SA 3.0