We sometimes have 90%+ more packet loss on our server, but it does not always appends. Right now it works perfectly, but just half an hour ago, it had just that problem.
Our service provider is telling us to go in a recovery system to test if this is really a hardware problem and not software on our side. However, I don't see anything that can cause packet loss on our side, especially if it is not consistent.
Is there anything we could check before doing an other test on the recovery system?
We have a dedicated server at Hetzner.de. It is connected to 100MBit ethernet. We did not try to change anything on the hardware side, because our server provider want that we check our software before to continue to check the hardware.
Here is the mtr reports I have made. During that the report, we had 3 burst of packet loss and the rest of the time the server was reachable :
Client to server
HOST: mbp Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.0.1.1 0.0% 1000 0.4 0.2 0.2 3.4 0.2
2.|-- 10.0.1.1 0.3% 1000 27.5 29.7 5.9 237.3 34.6
3.|-- 10.170.172.121 0.4% 1000 17.2 41.9 7.2 334.1 44.2
4.|-- 216.113.123.158 1.4% 1000 44.4 58.6 10.6 299.6 49.2
5.|-- 216.113.123.194 1.1% 1000 36.6 72.9 19.4 330.7 48.1
6.|-- paix-nyc.init7.net 0.7% 1000 57.1 75.8 18.4 313.8 49.1
7.|-- r1lon1.core.init7.net 1.4% 1000 199.8 150.9 87.1 373.7 56.4
8.|-- r1fra1.core.init7.net 0.6% 1000 244.2 150.1 98.6 438.6 53.6
9.|-- gw-hetzner.init7.net 1.4% 1000 175.3 140.6 100.5 397.2 49.7
10.|-- hos-bb2.juniper2.rz16.het 39.0% 1000 120.0 136.7 103.5 362.6 44.3
11.|-- hos-tr4.ex3k13.rz16.hetzn 0.8% 1000 145.4 132.2 106.8 393.3 36.9
12.|-- static.98.43.9.5.clients. 39.8% 1000 116.0 131.5 106.1 371.8 34.4
Server to client
HOST: thetransitapp Loss% Snt Last Avg Best Wrst StDev
1. static.97.43.9.5.clients.you 29.0% 1000 7.2 7.4 0.9 24.9 1.9
2. hos-tr1.juniper1.rz16.hetzne 38.7% 1000 6.1 9.6 0.2 78.8 7.6
3. hos-bb2.juniper4.ffm.hetzner 36.2% 1000 11.8 11.4 5.8 29.0 1.5
4. r1fra1.core.init7.net 38.1% 1000 12.4 13.9 5.5 22.9 3.9
5. r1lon1.core.init7.net 36.3% 1000 23.5 26.5 17.6 37.6 4.4
6. r1nyc1.core.init7.net 35.5% 1000 92.3 93.8 86.1 103.0 3.7
7. paix-ny.ia-unyc-bb05.vtl.net 35.5% 1000 95.5 96.4 87.6 134.7 5.3
8. 216.113.123.169 36.3% 1000 101.5 102.0 94.4 124.9 3.6
9. 216.113.124.42 34.7% 1000 113.1 107.7 96.7 117.6 3.6
10. 216.113.123.157 37.5% 999 106.5 107.4 101.5 115.0 1.5
11. ??? 100.0 999 0.0 0.0 0.0 0.0 0.0
12. modemcable004.103-176-173.mc 36.7% 999 111.2 147.9 107.2 342.0 48.3
Here is the ethernet configuration
Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Link partner advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000033 (51)
Link detected: yes
ifconfig of eth0:
eth0 Link encap:Ethernet HWaddr c8:60:00:bd:2f:9d
inet addr:5.9.43.98 Bcast:5.9.43.127 Mask:255.255.255.224
inet6 addr: fe80::ca60:ff:febd:2f9d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3521 errors:0 dropped:0 overruns:0 frame:0
TX packets:2117 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2882770 (2.7 MiB) TX bytes:910907 (889.5 KiB)
Interrupt:30 Base address:0x8000
In my opinion it's hetzner fault. I've been arguing with them for a very long time about similar case.
We had those problems and were reporting it to the hosting company. The answer was always the same - "Please attach mtr in both directions" - they would answer like that even during the fault. So we did write a daemon that will launch mtr each time we have any packet loss between servers :
if [ -z $1 ] ; then echo "Give target host" else host=$1 while true ; do loss=`ping -c 10 $host | grep packet | awk {'print $6'} | sed s/%//g` if [ $loss -ge 1 ]; then echo `date` >> /root/scripts/loss_measure_mtr.log mtr -s 1500 -r -c 1000 -i 0.1 $host >> /root/scripts/loss_measure_mtr.log fi done fi
Then with this information they answered :
At this time there was an incoming attack in the subnet. In this case it is possible that packet-loss occurs at servers in the same subnet. Best Regards Michael Straetz Hetzner Online AG Support 90431 Nürnberg / Germany Tel: +49 (911) 234 226 54 Fax: +49 (911) 234 226 8 977 http://www.hetzner.de
What is exactly happening ? I dont' know but it looks almost the same :
Sun Aug 12 01:13:20 CEST 2012 HOST: app Loss% Snt Last Avg Best Wrst StDev 1. 94.1% 1000 0.2 0.2 0.1 0.4 0.1 2. static.1.24.24.46.clients.you 0.0% 1000 3.0 1.9 0.7 19.4 1.5 3. hos-tr4.juniper2.rz13.hetzne 9.4% 1000 0.6 1.9 0.4 133.2 8.0 4. hos-bb2.juniper1.rz1.hetzner 5.4% 1000 38.6 7.1 3.0 112.9 11.5 5. hos-tr1.ex3k3.rz1.hetzner.de 10.9% 1000 4.4 5.1 3.6 23.6 1.8** 6. static.88-128-24-108.clients 15.5% 1000 3.6 3.5 3.4 4.6 0.1 HOST: app Loss% Snt Last Avg Best Wrst StDev 1. 94.5% 1000 0.2 0.2 0.1 0.6 0.1 2. static.1.24.24.46.clients.you 0.0% 1000 1.2 1.9 0.7 19.3 1.6 3. hos-tr4.juniper2.rz13.hetzne 9.3% 1000 0.6 1.8 0.4 136.8 7.9 4. hos-bb2.juniper1.rz1.hetzner 2.7% 1000 3.3 7.0 3.0 113.1 11.5 5. hos-tr1.ex3k3.rz1.hetzner.de 8.5% 1000 7.0 5.1 3.6 26.8 2.0 6. static.88-128-24-108.clients 12.8% 1000 3.6 3.5 3.3 4.5 0.1 I have tens of mtr's like this.
In my opinion it's their infrastructure problems. Notice that loss is occuring on the nodes : hos-tr1.ex3k3.rz1.hetzner.de, hos-tr4.juniper2.rz13.hetzner.de and so on.
If they don't fix that I'll probably migrate to linode or amazon.
This isn't an answer but it's too long for a comment, hence my posting it as an answer.
I don't completely agree with the assesments made in the existing answser and some of the comments to this question.
The problem with using any tool that uses ICMP via ping and traceroute (like mtr if I'm understanding how it works) is that said tool is testing how each hop in the path responds to the ICMP traffic, meaning the test is sent TO each hop and then measures that hops response. This is not a true test of the quality of the path THROUGH each hop in the path, meaning it's not testing the transmission of "real" traffic THROUGH the path. Each hop may choose to give your ICMP based test low priority or it may drop it altogether, hence the variation in results from one hop to the next. If you had a true problem at hop 10 (in your first screen grab) then that problem would carry through (and be cumulative) at each successive hop. As you can see in your sceen grab, hop 10 is showing 39% packet loss but hop 11 is showing almost no packet loss. If hop 10 were really dropping "real" traffic then the problem would manifest at hop 11 as well. In fact, hop 11 would probably show more packet loss (cumulative of the loss at hop 10, the loss on the link between hops 10 and 11, and the loss at hop 11).
What you should be doing is testing with a tool that sends real traffic from one end to the other, like iperf.
User contributions licensed under CC BY-SA 3.0