How to determine root cause for Communication link failure TCP Provider: The specified network name is no longer available?

25

Here is my latest effort at revising this question. But this time, I am trying to follow the good counsel given by Oded in his article Getting good answers on StackOverflow.

I need to find out how I can determine the root cause for the following error:

Communication link failure

TCP Provider: The specified network name is no longer available

From time to time, I am seeing this error when running a set of SSIS packages. This error can occur when one to many packages are run from:

  1. A SQL Server Agent Job
  2. A batch file
  3. In debug mode from BIDS

The full error message I see is as follows:

SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available. Source: "Microsoft SQL Server Native Client 10.0" Hresult: 0x80004005 Description: "Communication link failure". An OLE DB record is available. Source: "Microsoft SQL Server Native Client 10.0" Hresult: 0x80004005 Description: "TCP Provider: The specified network name is no longer available. ".

SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available. Source: "Microsoft SQL Server Native Client 10.0" Hresult: 0x80004005 Description: "Protocol error in TDS stream". An OLE DB record is available. Source: "Microsoft SQL Server Native Client 10.0" Hresult: 0x80004005 Description: "Communication link failure". An OLE DB record is available. Source: "Microsoft SQL Server Native Client 10.0" Hresult: 0x80004005 Description: "TCP Provider: An existing connection was forcibly closed by the remote host."

This is an overview of how I have designed the ETL process:

  • Two servers
  • Both are virtual machines
  • The SSIS packages run on an application server
  • The SQL Server database lives on a database server

I use an OLE DB connection manager to connect from the SSIS package on the application server to the SQL Server database on the database server.

The packages run as a file system deployment on the application server and not as a database deployment on the database server.

The main reason for this is that the ETL is integrated with a set of tools no found on and drives not accessible to the database server. These tools include Apex Data Loader for Salesforce and pgAdmin III.

So far I cannot consistently reproduce this error. However, this is what I have observed:

  • Failure occurs more frequently during regular business hours
  • Failure occurs less frequently during off hours

For about a two hour period on a Friday morning I was able to successfully reproduce the error on a specific package.

The error occurred during a large data flow if a child package call that precedes the large data flow was enabled.

The error did not occur during the same large data flow if the child package call that precedes the large data flow was disabled.

The child package in question calls back to the database to retrieve a tiny amount of information for use in an email body and then sends the email.

It feels like maybe a resource limit is being exceeded?

Maybe a connection limit?

I am wondering what tools I should be using to try and determine the root cause of the error.

Technical details about the two servers involved are listed below:

SQL Server and Database Server info:
Microsoft SQL Server 2008 R2 (SP1) - 10.50.2500.0 (X64) Jun 17 2011 00:54:03 Copyright (c) Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1) (Hypervisor)

SSIS info:
Microsoft Visual Studio 2008 Version 9.0.30729.1 SP Microsoft .NET Framework Version 3.5 SP1

Application Server info:
OS name: Microsoft Windows Server 2008 R2 Standard Version: 6.1.7601 Service Pack 1 Build 7601

I have researched the error message online and found these, but would really like to get an expert's insight before proceeding:

Any help is appreciated.

Thanks

UPDATE:

Further testing shows that this is not "an SSIS thing" as the same error is seen at the same rate when using SQL Server Management Studio. The complexity of the query does not make the error more or less likely. In an attempt to resolve, we have tried one fix (below):

This was our first attempt. TCP Chimney is now disabled on the Application Server and the Database Server. Testing shows that the same error occurs at the same rate.

So where to go from here? Honestly I am not sure. One seemingly good option remains:

  • Application Server and Database Server SQL Server installations do not exactly match
  • Application Server = SQL Server 2008 (SP1) - 10.0.2531.0 (X64)
  • Database Server = SQL Server 2008 R2 (SP1) - 10.50.2500.0 (X64)

The plan is to upgrade the SQL Server installation on the Application server. Its kind of a hit and hope, but at this point this seems like the best option. Something in my brain tells me that this might be solved by fixing a hardware issue (by that I mean a repair or a replace) and that there might not be anything that hardware and software configuration can do about it.

However, I am still not sure how to go about determining a root cause. I am still left wondering what tools I should be using to diagnose the root cause.

sql-server-2008
networking
ssis
windows-server-2008
virtual-machine
asked on Stack Overflow May 21, 2013 by Jon Jaussi • edited Feb 16, 2019 by Hadi

4 Answers

1

Do you have AV software on Application Server side? If yes, try to disable AV - sometime AV blocks TCP/IP traffic. Issue with "The specified network name is no longer available" was solved by disabling AV here: https://community.spiceworks.com/topic/239423-the-specified-network-name-is-no-longer-available-while-writing-to-shared-dir

answered on Stack Overflow May 23, 2018 by Serge
1

The error message indicates that the connection was forcibly closed. You have also mentioned this occur when you run a lot of jobs. Chances are, the network firewall is to blame. You should reach out to the firewall administrator to research the logs in order to see if the firewall closed the connection. If this is the case then two potential solutions exist:

  1. Add an exception to whatever firewall rule is being triggered and causing the connection to be closed.
  2. Stop running so many jobs concurrently. You should consider running them in sequence. This also adheres to the idea of being a good network citizen.
answered on Stack Overflow Nov 28, 2018 by J Weezy
0
  1. First off did you tried to remove the large send offload setting on the nic ?
  2. Second point, can you run a wireshark to capture the packets if you can reproduce the error ?
  3. Third point, did you tried to change the vnic from the VM ? some model can cause issue. (If you use vmxnet3, try e1000, etc..)
  4. Last point, does you have a vswitch between them, they are on same host, a physical switch between, etc... A badly configured switch can drop traffic, if inside the host the same host and same vswitch it's the best test, as the traffic never leave the server.
answered on Stack Overflow Dec 2, 2014 by yagmoth555 • edited Jan 31, 2017 by Shakeer Mirza
-2

Try using ODBC instead of OLE DB for your connection to the database.

answered on Stack Overflow May 17, 2017 by Michael Sims

User contributions licensed under CC BY-SA 3.0