COM Runtime Breakdown in Multithreaded Server Application

4

We are experiencing intermittent catastrophic failures of the COM runtime in a large server application.

Here's what we have:

A server process running as a Windows service hosts numerous free-threaded COM components written in C++/ATL. Multiple client processes written in C++/MFC and .NET use these components via cross-procces COM calls (incl .NET interop) on the same machine. The OS is Windows Server 2008 Terminal Server (32-bit). The entire software suite was developed in-house, we have the source code for all components. A tracing toolkit writes out errors and exceptions generated during operation.

What is happening:

After some random period of smooth sailing (5 days to 3 weeks) the server's COM runtime appears to fall apart with any combination of these symptoms:

  • RPC_E_INVALID_HEADER (0x80010111) - "OLE received a packet with an invalid header" returned to the caller on cross-process calls to server component methods
  • Calls to CoCreateInstance (CCI) fail for the CLSCTX_LOCAL_SERVER context
  • CoInitializeEx(COINIT_MULTITHREADED) calls fail with CO_E_INIT_TLS (0x80004006)

  • All in-process COM activity continues to run, CCI works for CLSCTX_INPROC_SERVER.

  • The overall system remains responsive, SQL Server works, no signs of problems outside of our service process.
  • System resources are OK, no memory leaks, no abnormal CPU usage, no thrashing

The only remedy is to restart the broken service.

Other (related) observations:

  • The number of cores on the CPU has an adverse effect - a six core Xeon box fails after roughly 5 days, smaller boxes take 3 weeks or longer.
  • .NET Interop might be involved, as running a lot of calls accross interop from .NET clients to unmanaged COM server components also adversely affects the system.
  • Switching on the tracing code inside the server process prolongs the working time to the next failure.

Tracing does introduce some partial synchronization and thus can hide multithreaded race condition effects. On the other hand, running on more cores with hyperthreading runs more threads in parallel and increases the failure rate.

Has anybody experienced similar behaviour or even actually come accross the RPC_E_INVALID_HEADER HRESULT? There is virtually no useful information to be found on that specific error and its potential causes. Are there ways to peek inside the COM Runtime to obtain more useful information about COM's private resource pool usage like memory, handles, synchronization primitives? Can a process' TLS slot status be monitored (CO_E_INIT_TLS)?

multithreading
debugging
com
com-interop
asked on Stack Overflow Jun 13, 2014 by comwerkstatt • edited Jun 17, 2014 by comwerkstatt

1 Answer

3

We are confident to have pinned down the cause of this defect to a resource leak in the .NET framework 4.0.

Installations of our server application running on .NET 4.0 (clr.dll: 4.0.30319.1) show the intermittent COM runtime breakdown and are easily fixed by updating the .NET framework to version 4.5.1 (clr.dll: 4.0.30319.18444)

Here's how we identified the cause:

Searches on the web turned up an entry in an MSDN forum: http://social.msdn.microsoft.com/Forums/pt-BR/f928f3cc-8a06-48be-9ed6-e3772bcc32e8/windows-7-x64-com-server-ole32dll-threads-are-not-cleaned-up-after-they-end-causing-com-client?forum=vcmfcatl

The OP there described receiving the HRESULT RPC_X_BAD_STUB_DATA (0x800706f7) from CoCreateInstanceEx(CLSCTX_LOCAL_SERVER) after running a COM server with an interop app for some length of time (a month or so). He tracked the issue down to a thread resource leak that was observable indirectly via an incrementing variable inside ole32.dll : EventPoolEntry::s_initState that causes CCI to fail once its value becomes 0xbfff...

An inspection of EventPoolEntry::s_initState in our faulty installations revealed that its value started out at approx. 0x8000 after a restart and then constantly gained between 100 and 200+ per hour with the app running under normal load. As soon as s_initState hit 0xbfff, the app failed with all the symptoms described in our original question. The OP in the MSDN forum suspected a COM thread-local resource leak as he observed asymmetrical calls to thread initialization and thread cleanup - 5 x init vs. 3 x cleanup.

By automatically tracing the value of s_initState over the course of several days we were able to demonstrate that updating the .NET framework to 4.5.1 from the original 4.0 completely eliminates the leak.

answered on Stack Overflow Jul 6, 2014 by comwerkstatt

User contributions licensed under CC BY-SA 3.0