Exchange 2016 on Windows Server 2012 BSoD

1

I have an Exchange 2016 Server which bsods with about 14 days inbetween. The server is virtual and exists in a clustered vmware environment with storage via iSCSI. None of the other Windows servers we have running (including the passive copy of Exchange) bsods. The passive Exchange is beeing backed up and clears the transaction-logs as it should on both the passive and active node.

  • I have tried installing the latest critical patches (none of the optional yet)
  • I have tried migrating the VM in question to a new host.

Here is what BSoD viewer gives me of information:

052716-21921-01.dmp 27.05.2016 10:22:16 CRITICAL_PROCESS_DIED   0x000000ef  ffffe000`de10d080   00000000`00000000   00000000`00000000   00000000`00000000   ntoskrnl.exe    ntoskrnl.exe+14e3a0 NT Kernel & System  Microsoft® Windows® Operating System    Microsoft Corporation   6.3.9600.18289 (winblue_ltsb.160328-1315)   x64 ntoskrnl.exe+14e3a0                 C:\Windows\Minidump\052716-21921-01.dmp 8   15  9600    138 150 27.05.2016 10:22:47 
051516-25765-01.dmp 15.05.2016 10:11:06 CRITICAL_PROCESS_DIED   0x000000ef  ffffe001`0ad80900   00000000`00000000   00000000`00000000   00000000`00000000   ntoskrnl.exe    ntoskrnl.exe+14e3a0 NT Kernel & System  Microsoft® Windows® Operating System    Microsoft Corporation   6.3.9600.18289 (winblue_ltsb.160328-1315)   x64 ntoskrnl.exe+14e3a0                 C:\Windows\Minidump\051516-25765-01.dmp 8   15  9600    138 150 15.05.2016 10:11:41 
042816-19328-01.dmp 28.04.2016 22:36:50 CRITICAL_PROCESS_DIED   0x000000ef  ffffe001`3da4f900   00000000`00000000   00000000`00000000   00000000`00000000   ntoskrnl.exe    ntoskrnl.exe+14e8a0 NT Kernel & System  Microsoft® Windows® Operating System    Microsoft Corporation   6.3.9600.18289 (winblue_ltsb.160328-1315)   x64 ntoskrnl.exe+14e8a0                 C:\Windows\Minidump\042816-19328-01.dmp 8   15  9600    294 472 28.04.2016 22:39:45 
041916-23859-01.dmp 19.04.2016 08:43:53 CRITICAL_PROCESS_DIED   0x000000ef  ffffe001`23101900   00000000`00000000   00000000`00000000   00000000`00000000   ntoskrnl.exe    ntoskrnl.exe+14e8a0 NT Kernel & System  Microsoft® Windows® Operating System    Microsoft Corporation   6.3.9600.18289 (winblue_ltsb.160328-1315)   x64 ntoskrnl.exe+14e8a0                 C:\Windows\Minidump\041916-23859-01.dmp 8   15  9600    294 472 19.04.2016 08:47:04 

I saw a post with the same problem on a diffrent site, but none actually answered the problem and the post aged out.

Do anyone have any pointers on how to fix this? Would I have to install ANOTHTER Exchange server and migrate into? This would be very unfortunate..

exchange
vmware-esxi
windows-server-2012-r2
exchange-2016
asked on Server Fault May 28, 2016 by xstnc

2 Answers

5

Your storage system is failing or too slow to keep up. If IO has been stalled for too long, Exchange thinks that storage is dead and kills Wininit to force hard reset.

See https://technet.microsoft.com/en-us/library/ff625233.aspx and scroll to the end. It's the same for 2013 and 2016.

In some cases, the entire storage stack may be affected by the hang, making it impossible to write failure events to the crimson channel or any other area of the Windows Event Log. ESE also monitors the crimson channel by verifying that the event log can be written to. If writing to the event log fails for a long period of time, MSExchangeRepl intentionally causes a bugcheck of Windows by terminating wininit.exe. When the operating system I/O is hung, the system is obviously unable to write any ESE events to the event log.

I have experienced it firsthand when using Windows Server Backup to backup Exchange. When backup begins, it will do consistency check on all databases in parallel. This caused Exchange to BSoD after a few minutes when storage dropped out.

First solution is to disable ATS heartbeat to storage array https://kb.vmware.com/kb/2113956

Text is too long to copy but TL;DR: Your storage array connection may be dropped under heavy IO when ATS heartbeat of 8 seconds times out, that will cause IO timeout in VM, causing Exchange to BSoD.

Secondary solution is to add storage controllers to VM and distribute database disks between controllers. In my case, single pvscsi controller would choke badly under 6 databases, but when disks (including OS disk etc) were distributed over 4 pvscsi controllers, issues disappeared. I don't have a reference for that, just personal experience on vSphere 5.5 U3.

answered on Server Fault May 28, 2016 by Don Zoomik • edited May 28, 2016 by Don Zoomik
3

You can issue a command to disable the ESE forced reboot, the cause is well explained by Don's answer.

I did it latelly for a custumer with a single server with esx, as the IO was overkilling the Exchange. (its still killing it, as it take age to simply open a management console in example, but atleast it dont reboot..)

Add-GlobalMonitoringOverride -Identity ExchangeActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder -PropertyName Enabled -PropertyValue 0 -ApplyVersion “15.0.712.24

In there you need to use the correct Exchange version.

See there for Exchange version; https://technet.microsoft.com/en-us/library/hh135098(v=exchg.150).aspx

See there for furter detail; http://www.tecfused.com/2014/11/exchange-2013-dag-bsod/

answered on Server Fault May 28, 2016 by yagmoth555

User contributions licensed under CC BY-SA 3.0