UPDATE: The problem was faulty hardware on the switch. Thanks to all of you for the good debugging suggestions. Correct answer given to MattyB for suggesting using a different switch to see if the problem persisted.
Hello serverfault,
I am attempting to debug an issue on several nodes that are repeatedly detecting link loss for 1-2 minutes at a time, when there should be no link loss.
Servers:
- HP DL360 G5
- 1 on-board 2-port Broadcom NetXtreme II BCM5708 Gigabit Ethernet (rev 12) (using bnx2 driver)
- 1 4-port Intel 82571EB Gigabit Ethernet Controller (Copper) (rev 06) (using e1000e driver)
Facts:
- On all nodes, both Broadcom ports and one Intel port are connected to the same switch.
- UPDATE: Link loss is detected on ports on both NICs, Broadcom and Intel
- All ports are at Gb/s speed, except the Intel ports on two of the nodes, which are at 100Mb/s speed. All speeds set using auto-negotiation.
- All nodes were recently upgraded from RHEL 5.0 to RHEL 5.3.
I am currently attempting to get access to the switch to force Gbps/full duplex links. Is there anything other than that that could be done to diagnose or fix this issue? What further information would be useful?
EDIT: I've run tcpdump on one of the affected interfaces, and all I can see are LLDP packets, and a single IGMP Group Membership Query. I have also set the switch to force all ports to 1000Mbps links, full duplex. Does this indicate that the problem is internal to the node, and not caused by any settings on the switch?
====== Log messages ======
Oct 29 11:30:36 db1 kernel: bnx2: eth1 NIC Copper Link is Down
Oct 29 11:30:37 db1 kernel: bnx2: eth0 NIC Copper Link is Down
Oct 29 11:30:39 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
Oct 29 11:30:39 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
Oct 29 11:31:08 db1 kernel: bnx2: eth0 NIC Copper Link is Down
Oct 29 11:31:10 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
Oct 29 12:56:41 db1 kernel: bnx2: eth1 NIC Copper Link is Down
Oct 29 12:56:41 db1 kernel: bnx2: eth0 NIC Copper Link is Down
Oct 29 12:58:34 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
Oct 29 12:58:34 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
Oct 29 12:59:02 db1 kernel: bnx2: eth1 NIC Copper Link is Down
Oct 29 12:59:03 db1 kernel: bnx2: eth0 NIC Copper Link is Down
Oct 29 12:59:05 db1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
Oct 29 12:59:05 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
Oct 29 12:59:34 db1 kernel: bnx2: eth0 NIC Copper Link is Down
Oct 29 12:59:35 db1 kernel: bnx2: eth1 NIC Copper Link is Down
Oct 29 12:59:37 db1 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
====== ethtool output for all connected interfaces on one node ======
[root@db1 ~]# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Link detected: yes
[root@db1 ~]# ethtool eth1
Settings for eth1:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Link detected: yes
[root@db1 ~]# ethtool eth2
Settings for eth2:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbag
Wake-on: d
Current message level: 0x00000001 (1)
Link detected: yes
This is odd. Since you are experiencing loss on both nics I would suspect that this would rule out a nic specific firmware issue, a kernel driver issue, or a faulty hardware issue (Except with respect to the motherboard). Although the logs you have posted are specific to BNX2. Have you verified that other machines connected to this same switch with the same hardware profile are not exhibiting the same problem? You should try hard coding the nics to 100 mbit/full as well as the switch, and as silly as it sounds check for faulty cabling. Finally, if resources permit why not try hooking up that machine to a third party switch (like a netgear or something equally innocuous). ?
If multiple nodes are experiencing link loss simultaneously I would go as far as to say that you may have a spanning tree error that is consistently casing your switch to fail and re converge. Any more information as to topology would help diagnose the issue.
ethtool -K ethX tso off
Try this on the broadcom NICs. it should disable the ToE feature, which usually causes a lot of grief.
You can also try setting the ports to duplex or simplex, instead of auto negotiation.
Are you running the latest NIC and server firmware on your machines? Had a few similar issues when running outdated NIC firmware on HP DL380 and 360 systems.
What does dmesg look like for the Intel NIC?
Can you get access to the switch logs? What make/model of switch is it?
User contributions licensed under CC BY-SA 3.0