We are experiencing intermittent catastrophic failures of the COM runtime in a large server application.
Here's what we have:
A server process running as a Windows service hosts numerous free-threaded COM components written in C++/ATL. Multiple client processes written in C++/MFC and .NET use these components via cross-procces COM calls (incl .NET interop) on the same machine. The OS is Windows Server 2008 Terminal Server (32-bit). The entire software suite was developed in-house, we have the source code for all components. A tracing toolkit writes out errors and exceptions generated during operation.
What is happening:
After some random period of smooth sailing (5 days to 3 weeks) the server's COM runtime appears to fall apart with any combination of these symptoms:
CoInitializeEx(COINIT_MULTITHREADED) calls fail with CO_E_INIT_TLS (0x80004006)
All in-process COM activity continues to run, CCI works for CLSCTX_INPROC_SERVER.
The only remedy is to restart the broken service.
Other (related) observations:
Tracing does introduce some partial synchronization and thus can hide multithreaded race condition effects. On the other hand, running on more cores with hyperthreading runs more threads in parallel and increases the failure rate.
Has anybody experienced similar behaviour or even actually come accross the RPC_E_INVALID_HEADER HRESULT? There is virtually no useful information to be found on that specific error and its potential causes. Are there ways to peek inside the COM Runtime to obtain more useful information about COM's private resource pool usage like memory, handles, synchronization primitives? Can a process' TLS slot status be monitored (CO_E_INIT_TLS)?
We are confident to have pinned down the cause of this defect to a resource leak in the .NET framework 4.0.
Installations of our server application running on .NET 4.0 (clr.dll: 4.0.30319.1) show the intermittent COM runtime breakdown and are easily fixed by updating the .NET framework to version 4.5.1 (clr.dll: 4.0.30319.18444)
Here's how we identified the cause:
Searches on the web turned up an entry in an MSDN forum: http://social.msdn.microsoft.com/Forums/pt-BR/f928f3cc-8a06-48be-9ed6-e3772bcc32e8/windows-7-x64-com-server-ole32dll-threads-are-not-cleaned-up-after-they-end-causing-com-client?forum=vcmfcatl
The OP there described receiving the HRESULT RPC_X_BAD_STUB_DATA (0x800706f7) from CoCreateInstanceEx(CLSCTX_LOCAL_SERVER) after running a COM server with an interop app for some length of time (a month or so). He tracked the issue down to a thread resource leak that was observable indirectly via an incrementing variable inside ole32.dll : EventPoolEntry::s_initState that causes CCI to fail once its value becomes 0xbfff...
An inspection of EventPoolEntry::s_initState in our faulty installations revealed that its value started out at approx. 0x8000 after a restart and then constantly gained between 100 and 200+ per hour with the app running under normal load. As soon as s_initState hit 0xbfff, the app failed with all the symptoms described in our original question. The OP in the MSDN forum suspected a COM thread-local resource leak as he observed asymmetrical calls to thread initialization and thread cleanup - 5 x init vs. 3 x cleanup.
By automatically tracing the value of s_initState over the course of several days we were able to demonstrate that updating the .NET framework to 4.5.1 from the original 4.0 completely eliminates the leak.
User contributions licensed under CC BY-SA 3.0