Ubuntu server: weird latency jumps in lan

3

We replaced our aging firewall with this server, running Ubuntu 16.04.

It does (almost) nothing other than running iptables with about 900 rules (filter & nat combined).

The aging server it replaced worked fine and there were no issues whatsoever.

Every once in a while (it can be once an hour or every 30 seconds) the latency between the new firewall and any other host on the LAN jumps from 0.1-0.2ms to 10, 40, 100 and even 3000ms for a few seconds (sometimes it even lasts minutes). I noticed it with a simple lag on an ssh connection to a host in the DMZ (shouldn't be any lag) and then tested it with simple continues, high-rate (-i 0.1) ping tests to various hosts.

I tested this on both the 10gbps interface and one of the 1gbps interfaces. The server is nowhere near it's network limits (~10Kpps, 100-400mbps up&down combined). CPU is idling at 99%

In one of the longer "outages" I connected to the firewall from the internet to debug it, I noticed that there are no issues with any other interface and all of them are okay with no latency issues.

In order to remove the switch out of the equation I moved the 1gbps interface to a different switch, outside our stack and added another server to the new switch to test against. The problem still persists, I run a constant ping to multiple machines and they all go up to 2-3 seconds every once in a while including the one in the "immediate" switch.

dmesg shows nothing, ifconfig shows no errors, /proc/interrupts shows that all cores participate in handling the nic(s) (although I am pretty sure that for such a low throughput even 1 core would suffice...)

Any suggestions or ideas how to debug such a scenario would be appreciated.

Thanks!

EDIT: Adding ethtool output:

Settings for eno1:

Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
               drv probe link
Link detected: yes

EDIT 2: Maybe it's irrelevant but I did see this in one of the (really long) outages:

%Cpu(s):  0.1 us,  3.3 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
KiB Mem : 16326972 total, 14633008 free,   296636 used,  1397328 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 15540780 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29163 root      20   0       0      0      0 S   8.0  0.0  14:08.45 kworker/4:0
31722 root      20   0       0      0      0 S   7.3  0.0   9:39.76 kworker/6:0
11677 root      20   0       0      0      0 S   5.6  0.0   0:04.65 kworker/3:1
149 root      20   0       0      0      0 S   4.0  0.0  27:21.36 kworker/2:1
46 root      20   0       0      0      0 S   0.3  0.0   0:06.93 ksoftirqd/6

Unusually high kworker cpu usage (normally it's around 1%). Any idea?

linux-networking
ubuntu-16.04
asked on Server Fault Jan 13, 2017 by GibsonLP • edited Jan 13, 2017 by GibsonLP

1 Answer

1

I have had a similar situation and this link helped us to solve our issues!

Essentially, you probably need to configure the TCP socket receive max buffer size to between 2-4mb, maybe even smaller if it doesn't affect your service as you are having so many large spikes.

To compare the issues:

  • Lots of healthy traffic with seemingly random, massive lag spikes that could last for a long period of time.
  • You've confirmed the issue is with your new firewall.
  • All of your data from testing is telling you there is no problem.
  • It is a very occasional, seemingly random delay between the packet being received by the OS and processed.

Hope this is helpful!

answered on Server Fault Jan 13, 2017 by MildCorma • edited Jan 13, 2017 by MildCorma

User contributions licensed under CC BY-SA 3.0