GKE pods connection issues when having high load of requests

0

I have a GKE v1.17.400 private cluster running with NAT gateway. On the cluster I have multiple application that use google services such as stackdriver, pubsub and cloud sql.

My applications are running on .net-core 2.2. It subscribing and publishing to a Pub/Sub topic.

When having high load I am experiencing issues of connectivity with Google Cloud services.

This issues cause many different logs such as:

Connect timeout with cloud sql:

MySql.Data.MySqlClient.MySqlException (0x80004005): Connect Timeout expired. ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at MySqlConnector.Core.ServerSession.ConnectAsync(ConnectionSettings cs, ILoadBalancer loadBalancer, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ServerSession.cs:line 360
   at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ConnectionPool.cs:line 112
   at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, IOBehavior ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\Core\ConnectionPool.cs:line 141
   at MySql.Data.MySqlClient.MySqlConnection.CreateSessionAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 507
   at MySql.Data.MySqlClient.MySqlConnection.CreateSessionAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 523
   at MySql.Data.MySqlClient.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlConnection.cs:line 232
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenDbConnectionAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
   at Pomelo.EntityFrameworkCore.MySql.Storage.Internal.MySqlRelationalConnection.BeginTransactionAsync(IsolationLevel isolationLevel, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionAsync(CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(DbContext _, ValueTuple`2 parameters, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IReadOnlyList`1 entriesToSave, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)

Health checks of the pods that fail, sometimes pods get restarted:

fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
      Health check Users-Database completed after 18010.9435ms with status Unhealthy and 'FAILED to access users table.'

Issues with Google Cloud Storage and Google Stackdriver:

Unable to log to provider GoogleStackdriverLogProvider, ex: Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Getting metadata from plugin failed with error: Exception occurred in metadata credentials plugin. System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Security.SslState.ThrowIfExceptional()
   at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
   at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
   at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Google.Apis.Http.ConfigurableMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Google.Apis.Auth.OAuth2.Requests.TokenRequestExtenstions.ExecuteAsync(TokenRequest request, HttpClient httpClient, String tokenServerUrl, CancellationToken taskCancellationToken, IClock clock, ILogger logger)
   at Google.Apis.Auth.OAuth2.ServiceAccountCredential.RequestAccessTokenAsync(CancellationToken taskCancellationToken)
   at Google.Apis.Auth.OAuth2.TokenRefreshManager.RefreshTokenAsync()
   at Google.Apis.Auth.OAuth2.TokenRefreshManager.GetAccessTokenForRequestAsync(CancellationToken cancellationToken)
   at Google.Apis.Auth.OAuth2.ServiceAccountCredential.GetAccessTokenForRequestAsync(String authUri, CancellationToken cancellationToken)
   at Google.Apis.Auth.OAuth2.ServiceCredential.GetAccessTokenWithHeadersForRequestAsync(String authUri, CancellationToken cancellationToken)
   at Grpc.Auth.GoogleAuthInterceptors.<>c__DisplayClass3_0.<<FromCredential>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Grpc.Core.Internal.NativeMetadataCredentialsPlugin.GetMetadataAsync(AuthInterceptorContext context, IntPtr callbackPtr, IntPtr userDataPtr)", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1611499756.899018095","description":"Getting metadata from plugin failed with error: Exception occurred in metadata credentials plugin. System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Net.Security.SslState.ThrowIfExceptional()\n   at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)\n   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)\n   at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)\n   at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)\n   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)\n   --- End of inner exception stack trace ---\n   at Google.Apis.Http.ConfigurableMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)\n   at Google.Apis.Auth.OAuth2.Requests.TokenRequestExtenstions.ExecuteAsync(TokenRequest request, HttpClient httpClient, String tokenServerUrl, CancellationToken taskCancellationToken, IClock clock, ILogger logger)\n   at Google.Apis.Auth.OAuth2.ServiceAccountCredential.RequestAccessTokenAsync(CancellationToken taskCancellationToken)\n   at Google.Apis.Auth.OAuth2.TokenRefreshManager.RefreshTokenAsync()\n   at Google.Apis.Auth.OAuth2.TokenRefreshManager.GetAccessTokenForRequestAsync(CancellationToken cancellationToken)\n   at Google.Apis.Auth.OAuth2.ServiceAccountCredential.GetAccessTokenForRequestAsync(String authUri, CancellationToken cancellationToken)\n   at Google.Apis.Auth.OAuth2.ServiceCredential.GetAccessTokenWithHeadersForRequestAsync(String authUri, CancellationToken cancellationToken)\n   at Grpc.Auth.GoogleAuthInterceptors.<>c__DisplayClass3_0.<<FromCredential>b__0>d.MoveNext()\n--- End of stack trace from previous location where exception was thrown ---\n   at Grpc.Core.Internal.NativeMetadataCredentialsPlugin.GetMetadataAsync(AuthInterceptorContext context, IntPtr callbackPtr, IntPtr userDataPtr)","file":"/var/local/git/grpc/src/core/lib/security/credentials/plugin/plugin_credentials.cc","file_line":93,"grpc_status":14}")
   at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>c__DisplayClass0_0`2.<<WithRetry>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at MyCode.Utils.LogService.Providers.GoogleStackdriverLogProvider.WriteAsync(IEnumerable`1 entries)
   at MyCode.Utils.LogService.Logger.WritePendingEntries()
875 Information Error while calling `UploadGZipObjectAsync`. Retry (00:00:08) taking place. Exception=System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Security.SslState.ThrowIfExceptional()
   at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
   at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
   at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Google.Cloud.Storage.V1.StorageClientImpl.UploadHelper.CheckFinalProgress()
   at Google.Cloud.Storage.V1.StorageClientImpl.UploadHelper.ExecuteAsync(CancellationToken cancellationToken)

Seems like there are multiple connectivity issues that are related and transient when a high load is happening.

My Investigation

Looked at the dmesg of pods that has connectivity issues and received this:

[Tue Jan 26 15:44:30 2021] systemd-journald[120]: Received request to flush runtime journal from PID 1
[Tue Jan 26 15:44:31 2021] EXT4-fs (sda1): resizing filesystem from 1533435 to 25126395 blocks
[Tue Jan 26 15:44:33 2021] EXT4-fs (sda1): resized filesystem to 25126395
[Tue Jan 26 15:44:40 2021] Bridge firewalling registered
[Tue Jan 26 15:44:40 2021] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[Tue Jan 26 15:44:43 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:43 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:46 2021] EXT4-fs (sda1): re-mounted. Opts: commit=30
[Tue Jan 26 15:44:49 2021] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[Tue Jan 26 15:44:50 2021] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
[Tue Jan 26 15:44:50 2021] IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
[Tue Jan 26 15:44:50 2021] IPVS: ipvs loaded.
[Tue Jan 26 15:44:50 2021] IPVS: [rr] scheduler registered.
[Tue Jan 26 15:44:50 2021] IPVS: [wrr] scheduler registered.
[Tue Jan 26 15:44:50 2021] IPVS: [sh] scheduler registered.
[Tue Jan 26 15:44:53 2021] IPv6: ADDRCONF(NETDEV_UP): veth1e78fb72: link is not ready
[Tue Jan 26 15:44:53 2021] IPv6: ADDRCONF(NETDEV_CHANGE): veth1e78fb72: link becomes ready
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth1e78fb72 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 1(veth1e78fb72) entered forwarding state
[Tue Jan 26 15:44:53 2021] device cbr0 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth51bc9563 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 2(veth51bc9563) entered forwarding state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered disabled state
[Tue Jan 26 15:44:53 2021] device veth902031c6 entered promiscuous mode
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered blocking state
[Tue Jan 26 15:44:53 2021] cbr0: port 3(veth902031c6) entered forwarding state
[Wed Jan 27 07:45:00 2021] python3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=609
[Wed Jan 27 07:45:00 2021] python3 cpuset=e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4 mems_allowed=0
[Wed Jan 27 07:45:00 2021] CPU: 1 PID: 353763 Comm: python3 Not tainted 4.19.150+ #1
[Wed Jan 27 07:45:00 2021] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[Wed Jan 27 07:45:00 2021] Call Trace:
[Wed Jan 27 07:45:00 2021]  dump_stack+0x61/0x96
[Wed Jan 27 07:45:00 2021]  dump_header+0x76/0x3a0
[Wed Jan 27 07:45:00 2021]  oom_kill_process+0xb1/0x280
[Wed Jan 27 07:45:00 2021]  out_of_memory+0x30a/0x4b0
[Wed Jan 27 07:45:00 2021]  try_charge+0x6b8/0x9c0
[Wed Jan 27 07:45:00 2021]  mem_cgroup_try_charge+0x1d7/0x220
[Wed Jan 27 07:45:00 2021]  mem_cgroup_try_charge_delay+0x1e/0x40
[Wed Jan 27 07:45:00 2021]  handle_mm_fault+0xeeb/0x1640
[Wed Jan 27 07:45:00 2021]  __do_page_fault+0x25f/0x480
[Wed Jan 27 07:45:00 2021]  ? page_fault+0x8/0x30
[Wed Jan 27 07:45:00 2021]  page_fault+0x1e/0x30
[Wed Jan 27 07:45:00 2021] RIP: 0033:0x7f8ab337dd16
[Wed Jan 27 07:45:00 2021] Code: 8e c0 01 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 c5 fd e7 07 <c5> fd e7 4f 20 c5 fd e7 57 40 c5 fd e7 5f 60 48 81 c7 80 00 00 00
[Wed Jan 27 07:45:00 2021] RSP: 002b:00007f88b2ff3568 EFLAGS: 00010202
[Wed Jan 27 07:45:00 2021] RAX: 00007f88976de058 RBX: 00000000000000e4 RCX: 00007f889c0af12c
[Wed Jan 27 07:45:00 2021] RDX: 000000000060c0ec RSI: 00007f88a05b8048 RDI: 00007f889baa2fe0
[Wed Jan 27 07:45:00 2021] RBP: 00007f889c1f3040 R08: fffffffffffffff8 R09: 0000000000000000
[Wed Jan 27 07:45:00 2021] R10: 00007f889c0af14c R11: 00007f88976de058 R12: 0000000000000000
[Wed Jan 27 07:45:00 2021] R13: 00007f88976de058 R14: 00000000000000a4 R15: 312f67726f2e3377
[Wed Jan 27 07:45:00 2021] Task in /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4 killed as a result of limit of /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3
[Wed Jan 27 07:45:00 2021] memory: usage 10485700kB, limit 10485760kB, failcnt 123
[Wed Jan 27 07:45:00 2021] memory+swap: usage 10485700kB, limit 9007199254740988kB, failcnt 0
[Wed Jan 27 07:45:00 2021] kmem: usage 80624kB, limit 9007199254740988kB, failcnt 0
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/db170fc2e6dfbc965920076d9c1c264724b871b1769612889e6a6687779d8072: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Memory cgroup stats for /kubepods/burstable/pode8528ae7-be9f-4b0e-b6e8-fc8b298b3ba3/e7bf5c5a2cb0af87a765c718705dfe197f0e462e955f64e380511e1a6101b6b4: cache:0KB rss:10404888KB rss_huge:1142784KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:10405016KB inactive_file:0KB active_file:0KB unevictable:0KB
[Wed Jan 27 07:45:00 2021] Tasks state (memory values in pages):
[Wed Jan 27 07:45:00 2021] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Jan 27 07:45:00 2021] [   2092]     0  2092      256        1    32768        0          -998 pause
[Wed Jan 27 07:45:00 2021] [   4016]     0  4016     1160      224    61440        0           609 startup.sh
[Wed Jan 27 07:45:00 2021] [   4049]     0  4049  2458626  1604375 14508032        0           609 python3
[Wed Jan 27 07:45:00 2021] [   4926]     0  4926  2525494  1667890 15065088        0           609 python3
[Wed Jan 27 07:45:00 2021] [   4927]     0  4927  2499074  1637148 14647296        0           609 python3
[Wed Jan 27 07:45:00 2021] [   4928]     0  4928  2503558  1637534 14741504        0           609 python3
[Wed Jan 27 07:45:00 2021] Memory cgroup out of memory: Kill process 4926 (python3) score 1246 or sacrifice child
[Wed Jan 27 07:45:00 2021] Killed process 4926 (python3) total-vm:10101976kB, anon-rss:6560464kB, file-rss:111096kB, shmem-rss:0kB
[Wed Jan 27 07:45:00 2021] oom_reaper: reaped process 4926 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Does this seem related?

I checked the NAT Gateway and checked if I have enough ports for my nodes. I have 2 IP addresses and about 50 nodes. Running with Minimum ports per VM of 128(Tried increasing to 1024, didn't fix) this should be enough for 1024 instances. So it does not seem like the issue here.

Questions

How can I solve this issue?

How can I investigate it more?

docker
.net-core
google-cloud-platform
google-kubernetes-engine
google-cloud-sql
asked on Stack Overflow Jan 27, 2021 by Montoya

1 Answer

1

We have discovered that the issue occurred because of we were using c# built-in ManualResetEvent from an async code. Seems like it caused some sort of a deadlock for the application threads.

Using SemaphoreSlim instead fixed the issue.

answered on Stack Overflow Feb 2, 2021 by Montoya

User contributions licensed under CC BY-SA 3.0