Troubleshooting NIC Logs by NETDEV WATCHDOG

1

I have a CentOS 6.6 server (Linux 2.6.32-504) running on a brand new Dell PowerEdge R320. Just the other day, I began noticing problems in /var/log/messages related to one of my NICs.

Note: this is a server cluster that requires eth0 to be used in a crossover network to a redundant backup server, so we use eth1 to talk to the customer's network and the world; so it would be challenging to switch NICs for troubleshooting purposes. I'll also note I'm not seeing any issues on eth0, though.

I'd like to troubleshoot this as much as possible ~ is it a hardware failure, Linux kernel doesn't like the Dell hardware, etc.? I am not a career Linux administrator, so I'm kind of green in this regard.

Google searches seem to turn up more questions for me, than answers (and only seem to apply to other distros; but point out possibilities such as kernel/firmware issues. I humbly beg for this community's expertise and ideas!

The one thing I noticed was that the interface was coming back up as 100Mb/half (when it's really on a 1Gb/full).

I'll provide as much as I can think of, that's hopefully relevant.

Here is the snippet of /var/log/messages where I noticed the issue... does this mean anything to anyone?

Mar 22 16:10:07 ind1un043 kernel: ------------[ cut here ]------------
Mar 22 16:10:07 ind1un043 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Not tainted)
Mar 22 16:10:07 ind1un043 kernel: Hardware name: PowerEdge R320
Mar 22 16:10:07 ind1un043 kernel: NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out
Mar 22 16:10:07 ind1un043 kernel: Modules linked in: nls_utf8 hfsplus(U) mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu coretemp 8021q garp stp llc wctc4xxp(U) dahdi_transcode(U) wcb4xxp(U) wctdm(U) wcfxo(U) wctdm24xxp(U) wcte11xp(U) wct1xxp(U) wcte12xp(U) dahdi_voicebus(U) wct4xxp(U) oct612x(U) dahdi(U) crc_ccitt ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput iTCO_wdt iTCO_vendor_support microcode dcdbas sb_edac edac_core ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif ahci wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Mar 22 16:10:07 ind1un043 kernel: Pid: 9, comm: ksoftirqd/1 Not tainted 2.6.32-504.el6.x86_64 #1
Mar 22 16:10:07 ind1un043 kernel: Call Trace:
Mar 22 16:10:07 ind1un043 kernel: <IRQ>  [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81074ee6>] ? warn_slowpath_fmt+0x46/0x50
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8147df7b>] ? dev_watchdog+0x26b/0x280
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81060d7c>] ? scheduler_tick+0xcc/0x260
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8147dd10>] ? dev_watchdog+0x0/0x280
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81087db7>] ? run_timer_softirq+0x197/0x340
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff810b0275>] ? tick_dev_program_event+0x65/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff810b034a>] ? tick_program_event+0x2a/0x30
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d765>] ? irq_exit+0x85/0x90
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81533c0a>] ? smp_apic_timer_interrupt+0x4a/0x60
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Mar 22 16:10:07 ind1un043 kernel: <EOI>  [<ffffffff8107d4c5>] ? ksoftirqd+0xd5/0x110
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d3f0>] ? ksoftirqd+0x0/0x110
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8109e66e>] ? kthread+0x9e/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Mar 22 16:10:07 ind1un043 kernel: ---[ end trace c154004f7af06fb3 ]---
Mar 22 16:10:07 ind1un043 kernel: tg3 0000:02:00.1: eth1: transmit timed out, resetting
Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000000: 0x165f14e4, 0x00100406, 0x02000000, 0x00800010
Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000010: 0xd90d000c, 0x00000000, 0xd90e000c, 0x00000000
Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000020: 0xd90f000c, 0x00000000, 0x00000000, 0x04f71028
... (TOO MANY CHARACTERS TO POST ON SERVER FAULT, SO STRIPPING) ...
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00007020: 0x00000000, 0x00000000, 0x00000406, 0x10004000
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00007030: 0x000e0000, 0x00004af8, 0x00170030, 0x00000000
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0: Host status block [00000001:00000063:(0000:0695:0000):(0000:01b8)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0: NAPI info [00000063:00000063:(01a5:01b8:01ff):0000:(0095:0000:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 1: Host status block [00000001:000000ac:(0000:0000:0000):(083a:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 1: NAPI info [000000ac:000000ac:(0000:0000:01ff):083a:(003a:003a:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 2: Host status block [00000001:0000004f:(089e:0000:0000):(0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 2: NAPI info [00000046:00000046:(0000:0000:01ff):0895:(0095:0095:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 3: Host status block [00000001:00000012:(0000:0000:0000):(0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 3: NAPI info [00000012:00000012:(0000:0000:01ff):0fbc:(07bc:07bc:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 4: Host status block [00000001:000000d4:(0000:0000:0742):(0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 4: NAPI info [000000d4:000000d4:(0000:0000:01ff):0742:(0742:0742:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: tg3_stop_block timed out, ofs=1400 enable_bit=2
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: tg3_stop_block timed out, ofs=c00 enable_bit=2
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is down
Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex
Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX
Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled

But before that, here is the snippet of /var/log/messages where the NIC initializes seemingly normally during a boot...

Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 14:18:77:32:77:3f
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: dma_rwctrl[00000001] dma_mask[64-bit]
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 14:18:77:32:77:40
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: dma_rwctrl[00000001] dma_mask[64-bit]

Here is the snippet of /var/log/messages where the NIC comes up seemingly normally, but not sure in what context (it's just after the init immediately above this snippet)...

Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Link is up at 1000 Mbps, full duplex
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Flow control is on for TX and on for RX
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: EEE is disabled
Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled
Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

And, here is the snippet of /var/log/messages where the NIC seems to soon-after have other issues, renegotiating or something. This happens randomly several times over the next few days...

Mar 21 16:01:42 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex
Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX
Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled
Mar 21 16:01:44 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

Here is my lspci | grep -i net...

02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
linux
centos6
dell-poweredge
nic
broadcom
asked on Server Fault Mar 25, 2016 by csr19us • edited Mar 25, 2016 by csr19us

1 Answer

0

Not sure if this is exactly an "answer" or not, but it did solve my issue, as far as this particular case is concerned. This seems to be related to ACPI.

I edited my /etc/grub.conf file to add the following to the end of the kernel line:

noapic acpi=off pci=noacpi

I'm now no longer seeing any errors in the log.

Could it be that power saving features were taking the interface down and bringing it back up in a slower mode? Or could it be some bug with the hardware/firmware and this kernel? Not sure ~ but this at least fixes my issue (it's a production server, whose network we never want to be down).

answered on Server Fault Mar 29, 2016 by csr19us

User contributions licensed under CC BY-SA 3.0