Debugging unreliable IPv6 connection

1

On our VPS we face connection issues with IPv6, hopefully someone can help to debug the issue.

Pings fail at first and succeed later:

2020-06-01 23:20:55 <user>@<host>:~# ping -6 google.com
PING google.com(ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e)) 56 data bytes
From <host>.com (<ip>) icmp_seq=1 Destination unreachable: Address unreachable
...
From <host>.com (<ip>) icmp_seq=6 Destination unreachable: Address unreachable
64 bytes from ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e): icmp_seq=7 ttl=54 time=14.0 ms
...
64 bytes from ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e): icmp_seq=13 ttl=54 time=12.1 ms
--- google.com ping statistics ---
13 packets transmitted, 7 received, +6 errors, 46% packet loss, time 12174ms
rtt min/avg/max/mdev = 12.151/12.683/14.069/0.767 ms

As can be seen the DNS resolving succeeds immediately, that is not the problem. The first outgoing pings throw an error message, from the 7th on it succeeds. How long it takes before the first ping succeeds varies.

curl switches to IPv4 immediately:

2020-06-01 23:21:16 <user>@<host>:~# curl -vIL google.com
* Rebuilt URL to: google.com/
*   Trying 2a00:1450:400e:807::200e...
* TCP_NODELAY set
*   Trying 172.217.17.142...
* TCP_NODELAY set
* Connected to google.com (172.217.17.142) port 80 (#0)
...

wget tries a bid longer to connect, and, sometimes succeeds, sometimes fails and switches to IPv4 as well:

2020-06-02 00:49:11 <user>@<host>:~# wget --spider google.com
Spider mode enabled. Check if remote file exists.
--2020-06-02 00:51:01--  http://google.com/
Resolving google.com (google.com)... 2a00:1450:400e:807::200e, 172.217.17.142
Connecting to google.com (google.com)|2a00:1450:400e:807::200e|:80... failed: No route to host.
Connecting to google.com (google.com)|172.217.17.142|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/ [following]
Spider mode enabled. Check if remote file exists.
--2020-06-02 00:51:20--  http://www.google.com/
Resolving www.google.com (www.google.com)... 2a00:1450:400e:804::2004, 172.217.17.36
Connecting to www.google.com (www.google.com)|2a00:1450:400e:804::2004|:80... failed: No route to host.
Connecting to www.google.com (www.google.com)|172.217.17.36|:80... connected.
HTTP request sent, awaiting response... 200 OK

This happens btw regardless of host/IP. Default route is there, the interface has a link-local address and a global IPv6 address, assigned via DHCPv6:

2020-06-02 00:58:25 <user>@<host>:~# ip -6 r
::1 dev lo proto kernel metric 256 pref medium
::/64 dev eth0 proto kernel metric 256 expires 2590394sec pref medium
<ipv6> dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via <gateway> dev eth0 proto ra metric 1024 expires 194sec pref medium

2020-06-02 00:58:56 <user>@<host>:~# ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 <ipv6>/128 scope global
       valid_lft forever preferred_lft forever
    inet6 <LLA>/64 scope link
       valid_lft forever preferred_lft forever

IPv4 connections always succeed immediately.

rdisc6 output:

2020-06-02 13:10:36 <user>@<host>:~# rdisc6 eth0
Soliciting ff02::2 (ff02::2) on eth0...

Hop limit                 :    undefined (      0x00)
Stateful address conf.    :          Yes
Stateful other conf.      :           No
Mobile home agent         :           No
Router preference         :       medium
Neighbor discovery proxy  :           No
Router lifetime           :         1800 (0x00000708) seconds
Reachable time            :  unspecified (0x00000000)
Retransmit time           :  unspecified (0x00000000)
 Source link-layer address: <MAC>
 Prefix                   : ::/64
  On-link                 :          Yes
  Autonomous address conf.:           No
  Valid time              :      2592000 (0x00278d00) seconds
  Pref. time              :       604800 (0x00093a80) seconds
 from fe80::<ipv6>

traceroute6 (this fails sometimes with 30 empty lines):

2020-06-02 13:14:18 <user>@<host>:~# traceroute6 google.com
traceroute to google.com (2a00:1450:400e:807::200e) from <ipv6>::142, port 33434, from port 54573, 30 hops max, 60 bytes packets
 1  * * <ipv6>::1 (<ipv6>::1)  2055.792 ms
 2  * 2a06:7f80::1 (2a06:7f80::1)  2055.700 ms  1.262 ms
 3  ipv6.decix-dusseldorf.core1.dus1.he.net (2001:7f8:9e::1b1b:0:1)  2058.316 ms  2.655 ms  2.810 ms
 4  100ge5-2.core1.ams1.he.net (2001:470:0:371::1)  4.658 ms  3.804 ms  3.865 ms
 5  de-cix.fra.google.com (2001:7f8::3b41:0:1)  4.731 ms  12.465 ms  9.900 ms
 6  2001:4860:0:11e1::e (2001:4860:0:11e1::e)  14.691 ms  10.691 ms  10.654 ms
 7  2001:4860:0:1::1c7f (2001:4860:0:1::1c7f)  12.320 ms  11.433 ms  11.476 ms
 8  2001:4860::c:4000:d9a9 (2001:4860::c:4000:d9a9)  15.681 ms  16.138 ms  14.906 ms
 9  ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e)  15.327 ms  12.979 ms  12.162 ms

ip monitor/ip mon route show that the default route seems to be not reliably reachable and is deleted regularly after being expired, and not always recreated shortly after. These are the outputs of a few hours:

fe80::<ipv6_1> dev eth0 lladdr <mac_1> PROBE
fe80::<ipv6_1> dev eth0 lladdr <mac_1> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_1> dev eth0 lladdr <mac_1> STALE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_4> dev eth0 lladdr <mac_4> PROBE
fe80::<ipv6_4> dev eth0 lladdr <mac_4> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_4> dev eth0 lladdr <mac_4> STALE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
Deleted default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 expires -4sec pref medium
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
Deleted default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 expires -11sec pref medium
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
Deleted default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 expires -3sec pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
<ipv4_1> dev eth0 lladdr <mac_1> PROBE
<ipv4_1> dev eth0 lladdr <mac_1> REACHABLE
<ipv4_1> dev eth0 lladdr <mac_1> STALE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE

Narrowing down the issue

The following shows that the router is not always sending router advertisements regularly enough so that the default gateway entry expires after 1800 seconds, note the timestamp of the last PS1 prompt when interrupting tcpdump:

2020-06-03 12:26:31 <user>@<host>:/var/log# tcpdump -n -i eth0 icmp6 and ip6[40] == 134
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:45:41.290680 IP6 fe80::XXX > ff02::1: ICMP6, router advertisement, length 56
14:11:10.133781 IP6 fe80::XXX > ff02::1: ICMP6, router advertisement, length 56
^C
2 packets captured
5 packets received by filter
0 packets dropped by kernel
2020-06-03 14:58:07 <user>@<host>:/var/log#

While the first two RAs were close enough to keep the default route (although already 4 minutes before expiry), the 3rd RA is missing too long, hence the default route was lost, hence no IPv6 connections are possible anymore.

Meanwhile I can see lots of neighbor solicitation from the router, hence its ICMPv6 requests do arrive.

2020-06-03 14:56:03 <user>@<host>:/var/log# tcpdump icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:03:07.750318 IP6 fe80::XXX > ff02::YYY: ICMP6, neighbor solicitation, who has 2a06:ZZZ, length 32
15:03:08.356100 IP6 fe80::XXX > ff02::YYY: ICMP6, neighbor solicitation, who has 2a06:ZZZ, length 32

But no RAs arrive, not even when trying to force them, currently:

2020-06-03 15:03:21 <user>@<host>:/var/log# rdisc6 eth0
Soliciting ff02::2 (ff02::2) on eth0...
Timed out.
Timed out.
Timed out.
No response.

This fits to the above ip monitor output where probing the router often simply fails. However since I see the NDs from the router, I guess it could answer me but for some reason does not respectively ignores my NDs?

I am able to manually restore the default route permanently via:

ip -6 r add default dev eth0 via fe80::<ipv6>

While IPv6 connections are again possible with this, they usually still have a long delay or time out completely.

debian
routing
ping
ipv6
vps
asked on Super User Jun 2, 2020 by MichaIng • edited Jun 6, 2020 by MichaIng

1 Answer

1

Note 1: You're only using DHCPv6 to obtain an address – it is not used for the default route. That's still done via SLAAC, i.e. ICMPv6 "Router Advertisement" packets.

Note 2: ip monitor shows several different kinds of events intermixed: addresses, routes, and neighbor cache entries. You can run ip mon route, ip mon neigh to see them separately.

I would guess that there is a problem in between your VPS and your nearest gateway, because:

  1. The neighbour entry for your default gateway (the IPv6 equivalent of ARP cache entry) does not successfully go into REACHABLE state – it keeps going into FAILED state, meaning your host sent several ND requests (the equivalent of ARP queries) to renew the cache entry but didn't receive any response.

    Neighbor discovery, just like ARP for IPv4, is the absolute bare minimum for a functioning IPv6 network.

  2. Expiry for the default route ::/0 is reset according to "Router lifetime" every time a SLAAC advertisement is received. In your case, the advertised lifetime is 1800 seconds, so the router should repeat the advertisement at least every 900 seconds so the default route never goes below half its lifetime.

    But as you can see from ip -6 route output, your ::/0 route was only 194 seconds from expiry. This either means the router's timers are misconfigured, or its broadcast RAs are just not reaching you for whatever reason – as a result, you keep losing the default route.

There's one thing common to both above issues: ND and SLAAC are both using ICMPv6 multicasts, so very carefully check whether your firewall isn't imposing strict rate limits on incoming Router Advertisements or Neighbor Adverts, or on multicast packets in general.

(You can use tcpdump to check whether you're receiving packets; e.g. if a RA shows up in tcpdump but fails to renew the default route then it may be your firewall's problem.)

answered on Super User Jun 2, 2020 by user1686 • edited Jun 3, 2020 by user1686

User contributions licensed under CC BY-SA 3.0