Kubernetes .NET Application SocketExceptionFactory+ExtendedSocketException

0

We recently started encountering issues rolling our .NET Core app out in k8s in Azure, where the application couldn't find hostnames, like our Azure Database name.

The problem seems intermittent as our old pods are still running fine, even when we bounce them, they come back fine.

The issue below seems like a problem with Hangfire, but it is actually the domain name resolution that fails.

Hangfire.SqlServer.SqlServerObjectsInstaller       - An exception occurred while trying to perform the migration. Retrying...
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000001, 11): Resource temporarily unavailable
   at System.Net.Dns.InternalGetHostByName(String hostName)
   at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
   at System.Data.SqlClient.SNI.SNITCPHandle.Connect(String serverName, Int32 port, TimeSpan timeout)
   at System.Data.SqlClient.SNI.SNITCPHandle..ctor(String serverName, Int32 port, Int64 timerExpire, Object callbackObject, Boolean parallel)
   at System.Data.ProviderBase.DbConnectionPool.CheckPoolBlockingPeriod(Exception e)
   at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](DbConnection dedicatedConnection, Func`2 func)
   at Hangfire.SqlServer.SqlServerStorage.UseConnection(DbConnection dedicatedConnection, Action`1 action)
   at Hangfire.SqlServer.SqlServerStorage.Initialize()
ClientConnectionId:00000000-0000-0000-0000-000000000000
.net
azure
kubernetes

1 Answer

0

It turned out that the issue was related to our Service Principal credentials that lapsed, which happens annually. This article explains how to update your service principal.

TL;DR

Run this bash script. (If you're on Windows, this will run in Git Bash. Just remember to install the Azure CLI. It won't work in PowerShell.)

RESOURCE=<your resource>
NAME=<cluster name>
SP_ID=$(az aks show --resource-group $RESOURCE --name $NAME --query servicePrincipalProfile.clientId -o tsv)
SP_SECRET=$(az ad sp credential reset --name $SP_ID --query password -o tsv)
az aks update-credentials --resource-group $RESOURCE --name $NAME --reset-service-principal --service-principal $SP_ID --client-secret $SP_SECRET

Note: The final command will run for a good 5 minutes+. I killed my process, but it still completed successfully.

answered on Stack Overflow Dec 10, 2020 by André Hauptfleisch • edited Dec 10, 2020 by André Hauptfleisch

User contributions licensed under CC BY-SA 3.0