Page MenuHomePhabricator

Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s
Closed, ResolvedPublic1 Estimated Story Points

Description

analytics1034's negotiated speed is not its optimal one:

elukey@analytics1034:~$ sudo ethtool eth0
Settings for eth0:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Half 1000baseT/Full
	Supported pause frame use: No
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Half 1000baseT/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  10baseT/Half 10baseT/Full
	                                     100baseT/Half 100baseT/Full
	                                     1000baseT/Full
	Link partner advertised pause frame use: Symmetric
	Link partner advertised auto-negotiation: Yes

	Speed: 100Mb/s      <=====================

	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: off
	Supports Wake-on: g
	Wake-on: d
	Current message level: 0x000000ff (255)
			       drv probe link timer ifdown ifup rx_err tx_err
	Link detected: yes

elukey@analytics1034:~$ sudo mii-tool eth0 --log
eth0: negotiated 1000baseT-FD flow-control, link ok

I tried to run mii-tool -r eth0 last week and it worked, negotiated speed went up to 1000Mb/s but then it feel again down.

A interesting dmesg entry popped up weeks ago before seeing the eth0 flapping periodically:

[Sun Jul 16 00:57:09 2017] tg3 0000:01:00.0 eth0: transmit timed out, resetting
[Sun Jul 16 00:57:10 2017] tg3 0000:01:00.0 eth0: 0x00000000: 0x165f14e4, 0x00100406, 0x02000000, 0x00800010
[...]
[Sun Jul 16 00:57:14 2017] tg3 0000:01:00.0 eth0: 3: NAPI info [000000f8:000000f8:(0000:0000:01ff):0e46:(0646:0646:0000:0000)]
[Sun Jul 16 00:57:14 2017] tg3 0000:01:00.0 eth0: 4: Host status block [00000001:00000072:(0000:0000:051b):(0000:0000)]
[Sun Jul 16 00:57:14 2017] tg3 0000:01:00.0 eth0: 4: NAPI info [00000048:00000048:(0000:0000:01ff):04f1:(04f1:04f1:0000:0000)]
[Sun Jul 16 00:57:14 2017] TCP: request_sock_TCP: Possible SYN flooding on port 5666. Sending cookies.  Check SNMP counters.
[Sun Jul 16 00:57:14 2017] tg3 0000:01:00.0 eth0: Link is down
[Sun Jul 16 00:57:17 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Sun Jul 16 00:57:17 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Sun Jul 16 00:57:17 2017] tg3 0000:01:00.0 eth0: EEE is disabled

And then a lot of:

[Thu Aug  3 15:55:14 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Thu Aug  3 15:55:14 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Thu Aug  3 15:55:14 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[..]
[Sat Aug  5 23:34:41 2017] tg3 0000:01:00.0 eth0: Link is down
[Sat Aug  5 23:35:03 2017] tg3 0000:01:00.0 eth0: Link is up at 100 Mbps, full duplex
[Sat Aug  5 23:35:03 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Sat Aug  5 23:35:03 2017] tg3 0000:01:00.0 eth0: EEE is disabled

The host seems out of warranty from a month ago, so if this is a eth0 hw issue it might be a problem.

@Cmjohnson any advice? Have you already seen this issue by any chance?

Event Timeline

Mentioned in SAL (#wikimedia-analytics) [2017-08-06T11:02:59Z] <elukey> stop yarn on analytics1034 to reload the tg3 driver - T172633

Tried this:

  • ifdown eth0
  • modprobe -r tg3
  • modprobe tg3
  • ifup eth0
[Mon Aug  7 07:28:13 2017] pps_core: LinuxPPS API ver. 1 registered
[Mon Aug  7 07:28:13 2017] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[Mon Aug  7 07:28:13 2017] PTP clock support registered
[Mon Aug  7 07:28:13 2017] tg3.c:v3.137 (May 11, 2014)
[Mon Aug  7 07:28:13 2017] tg3 0000:01:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address c8:1f:66:f1:24:71
[Mon Aug  7 07:28:13 2017] tg3 0000:01:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[Mon Aug  7 07:28:13 2017] tg3 0000:01:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[Mon Aug  7 07:28:13 2017] tg3 0000:01:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[Mon Aug  7 07:28:14 2017] tg3 0000:01:00.1 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address c8:1f:66:f1:24:72
[Mon Aug  7 07:28:14 2017] tg3 0000:01:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[Mon Aug  7 07:28:14 2017] tg3 0000:01:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[Mon Aug  7 07:28:14 2017] tg3 0000:01:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.0 eth2: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address c8:1f:66:f1:24:73
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.0 eth2: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.0 eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.0 eth2: dma_rwctrl[00000001] dma_mask[64-bit]
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.1 eth3: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address c8:1f:66:f1:24:74
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.1 eth3: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.1 eth3: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[Mon Aug  7 07:28:14 2017] tg3 0000:02:00.1 eth3: dma_rwctrl[00000001] dma_mask[64-bit]
[Mon Aug  7 07:28:14 2017] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[Mon Aug  7 07:28:17 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug  7 07:28:17 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug  7 07:28:17 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug  7 07:28:17 2017] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
elukey@analytics1034:~$ sudo ethtool eth0
Settings for eth0:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Half 1000baseT/Full
	Supported pause frame use: No
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Half 1000baseT/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  10baseT/Half 10baseT/Full
	                                     100baseT/Half 100baseT/Full
	                                     1000baseT/Full
	Link partner advertised pause frame use: Symmetric
	Link partner advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: off
	Supports Wake-on: g
	Wake-on: d
	Current message level: 0x000000ff (255)
			       drv probe link timer ifdown ifup rx_err tx_err
	Link detected: yes

While waiting for Chris' opinion I'll keep an eye on the eth0's status. The host is downtimed and not serving any Hadoop traffic.

Nuria set the point value for this task to 1.

@elukey: first and easiest thing is to swap the ethernet cable. Since you're not servicing traffic, I will just swap it out.

Looking good:

[Tue Aug  8 13:45:05 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Tue Aug  8 13:45:05 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Tue Aug  8 13:45:05 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Tue Aug  8 15:39:04 2017] tg3 0000:01:00.0 eth0: Link is down
[Tue Aug  8 15:39:26 2017] tg3 0000:01:00.0 eth0: Link is up at 100 Mbps, full duplex
[Tue Aug  8 15:39:26 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Tue Aug  8 15:39:26 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Wed Aug  9 06:30:36 2017] Process accounting resumed
[Thu Aug 10 06:30:42 2017] Process accounting resumed
[Fri Aug 11 06:30:48 2017] Process accounting resumed

So for the moment three days without any link flap, I'll let the weekend pass and repool the host on Monday if everything looks good.

I got fooled by dmesg, eth0 was operating at 100 Mbps and it was not flapping. After re-negotiation I can see this:

[Mon Aug 14 08:18:28 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 08:18:31 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 08:18:31 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 08:18:31 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 08:52:03 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 08:52:07 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 08:52:07 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 08:52:07 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 08:53:00 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 08:53:03 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 08:53:03 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 08:53:03 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 08:53:04 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 08:53:07 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 08:53:07 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 08:53:07 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 08:53:08 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 08:53:11 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 08:53:11 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 08:53:11 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 08:54:08 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 08:54:12 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 08:54:12 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 08:54:12 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 08:56:55 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 08:56:58 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 08:56:58 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 08:56:58 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 09:00:20 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 09:00:24 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 09:00:24 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 09:00:24 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 09:13:47 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 09:13:50 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 09:13:50 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 09:13:50 2017] tg3 0000:01:00.0 eth0: EEE is disabled
[Mon Aug 14 09:13:51 2017] tg3 0000:01:00.0 eth0: Link is down
[Mon Aug 14 09:13:58 2017] tg3 0000:01:00.0 eth0: Link is up at 1000 Mbps, full duplex
[Mon Aug 14 09:13:58 2017] tg3 0000:01:00.0 eth0: Flow control is on for TX and on for RX
[Mon Aug 14 09:13:58 2017] tg3 0000:01:00.0 eth0: EEE is disabled
elukey@analytics1034:~$ date
Mon Aug 14 09:10:25 UTC 2017

Stopped again analytics1034, @Cmjohnson I think that the cable swap didn't work :(

The cable seemed loose and would disconnect at the server eth0 port when touched. Replaced the cable and moved it around to confirm a good connection. I think this should resolve the issue.

Mentioned in SAL (#wikimedia-analytics) [2017-08-14T16:40:55Z] <elukey> analytics1034 back in service after swapping the eth cable - T172633