.NET Windows Service crashes while dispatching a Windows Message

0

I have a major problem with a .NET Windows Service. It runs on multiple servers with very different configurations. The service seems to be susceptible to crashes on some of the servers, but stable on others. The instability is introduced recently, but so far the conditions are unknown. We have servers running Windows 2003 / Windows 2003 R2 / Windows 2008. Most of them are fully updated.

We tried building the service against different target-framework-versions (2.0 / 3.5 / 4.0), but it doesn't make a difference. Machines that have an unstable service are unstable with every version of the framework. I've tried repairing the .NET frameworks, but that doesn't make a difference either. As far as I can review, the entire service and its dependencies are in managed code.

I've also tried to run the server-code in a commandline version. This seems to run stable. We use this as a work-around now. However, the problem is not related to the user-account. The service normally runs as "Local Service". I've tried to let it run under local Administrator account, which is the account the I use to run the Commandline version. But the service is still unstable.

So far, I've been able to create a reproducable situation on one of the servers: - Start the service on the server. - Log on as a domain-user in a new RDP session on the same server. - Start our client-software, which accesses the our service over TCP-remoting in that session. - Close the client and the session. - Open a new RDP session with the domain-user on the server. - Instant crash of service!

Note that the service crashes at the moment the domain-user logs onto the new RDP session. Our client-software has not been run in that session at that point. If I don't open the client and access the service with TCP remoting in the first session, the service won't crash during the second logon. If I open the sessions as local Administrator, the service does not crash either.

I've been able to attach a native debugger (OllyDbg) to the crashing service. It crashes with an Access Violation when trying to execute at address 0x4bcdcee9. That address is the same on all servers and configurations (I've seen that address every time in the eventlog). I have looked at the stack of the crashing thread. The thread seems to be created just before the crash. First it tries to load Ole32.dll. It runs some code from Ole32 and then I see these functions being called:

  • User32.SetTimer
  • User32.GetMessageW
  • User32.TranslateMessage
  • User32.DispatchMessageW

The crash is somewhere in DispatchMessageW. I can see the *MSG argument for DispatchMessageW on the stack. It looks like this is passed:

  • hWnd = 0x00090082
  • Message = 0x0000001e
  • wParam = 0x00000000
  • lParam = 0x00000000

I've tried Spy++. But it doesn't seem to detect any hWnd's in the Windows service.

So, the service receives this message, tries to parse and dispatch it and every time ends up calling 0x4bc4cee9, which is unmapped memory, and crashes.

EDIT: As per Hans' suggestion I investigated the systemevents. I debugged the service. I added an extra service to my service-executable, so that I could start the helper-service, then attach a debugger, and then start the real service. This way I am able to debug even the OnStart of the service. I placed breakpoints on SetWindowsHookA, SetWindowsHookW, SetWindowsHookExA and SetWindowsHookExW, but none of them was hit!?

EDIT 2: I checked all my notes and found that I jumped to the wrong conclusions, because I had a typo in my notes :-S Anyway, the address of the crash is 0x4bc4cee9. At some point in the execution, msado15.dll is loaded there. I can see that when the client disconnects from the server, there are 2 managed exceptions in the debugger. Shortly after that I see a WM_Timer message, which is handled by the dispatcher and it calls CoFreeUnusedLibraries(). That results in unloading msado15.dll. I opened the msado15.dll in a disassembler and loaded the symbols from Microsoft. The DLL is part of Microsoft Data Access Components (MDAC) 2.8 SP1. The version is 2.82.4795.0, indicating it is the latest version, released in January 2011. There are Advise() and Unadvise() functions for ADOConnection and ADORecordset. Advise() calls InitAsyncEvents() and that calls RegisterClassEx(). The WndProc that is passed to RegisterClassEx() is FireEventOnMainThread() which is at 0x4bc4cee9! I can see the function there! What should happen is that when the objects are disposed, the Unadvise() and DestroyAsyncEvents() and UnregisterClassEx() should be called. But somehow, that is not happening. The DLL gets unloaded before it can unregister the classes. Which result in a crash on the next event. This may somehow relate to the 2 managed exceptions. I will investigate further.

Stacktrace: http://pastebin.com/dsSjMe4Y

Log: http://pastebin.com/qD2MXvHd

I would really appreciate some guidance in this matter. Like, which process could be sending this message? And how is it possible that the service dispatches this completely wrong? How to avoid this?

Thank you, Heathcliff

.net
windows-services
crash
asked on Stack Overflow Oct 21, 2011 by Heathcliff • edited Oct 26, 2011 by Heathcliff

1 Answer

0

I found the problem. It took me almost 8 days to pin it down and create a work-around!

All ADODB versions up to 6.0 have a serious bug! ADODB 2.8 is part of MDAC 2.8 (for XP and Win2003), ADODB 6.0 is part of Vista/Win2008 and ADODB 6.1 is part of Win7/Win2008R2. The Core DLL is msado15.dll. When a Connection or Recordset class is instantiated, it is registered with RegisterClass() and it has a WndProc called __FireEventOnMainThread(). After all COM objects are disposed again, the reference count is set to 0. When Ole32!CoFreeUnusedLibraries() is called it will call DllCanUnloadNow() of all COM DLL's. DllCanUnloadNow() checks the reference-count and when it is 0 it will return 0, indicating it can unload. In ADODB 6.1 (only released for Win7 and Win2008R2) Microsoft implemented a fix in DllCanUnloadNow(). They check for the AsyncEventsWnd and if it still exists, they will not unload the DLL. But the real bug is still there in the COM object disposal. The reference-count is decreased, but for some reason UnregisterClass() is not called. When the DLL is unloaded and a broadcast event is sent, the applicion will run into an Access Violation, because the WndProc is not in memory anymore. Crash! In case of the service, a Ole32!CDllHost is instantiated (not sure where). This class starts a timer with TimerProc STAHostTimerProc(), firing every 300 seconds. STAHostTimerProc() calls CoFreeUnusedLibraries(). There are many different broadcast-messages. For example, when a new user session is started on a terminal server, it will broadcast WM_TIMECHANGE. So, on machines with Windows up to Vista/Win2008 when an application creates an ADODB.Connection or ADODB.Recordset and it creates an Ole32!CDllHost and then disposes all COM objects, and then wait for the timer to unload msado15.dll and then wait for a broadcast-message, that application will crash!

It's terrible that Microsoft fixed this in MDAC 6.1, but they did not release a fix for earlier versions. All older operating systems are affected.

As a work-around we will avoid that the reference-count of ADO COM objects will become 0 by creating a static ADODB.Connection object.

answered on Stack Overflow Oct 31, 2011 by Heathcliff

User contributions licensed under CC BY-SA 3.0