Page MenuHomePhabricator

cp1052 ethernet link down 2016-10-22 14:11
Closed, ResolvedPublic

Description

This is a prod cache_text machine. I've depooled it explicitly from service for now.

14:11 < icinga-wm> PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100%

RAC console still works, can log in, host is still up. No ethernet link, with the following in dmesg:

[Sat Oct 22 14:11:16 2016] bnx2x 0000:01:00.0 eth0: NIC Link is Down

While I'm looking at dmesg, there's also notable temp threshold events in dmesg ~4 days ago, which reminds that we still have T125205 outstanding to monitor for thermal issues...

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald Transcript

Interface on asw-c-eqiad says down as well:

bblack@asw-c-eqiad> show interfaces xe-8/0/7 
Physical interface: xe-8/0/7, Enabled, Physical link is Down
  Interface index: 926, SNMP ifIndex: 795
  Description: cp1052
  Link-level type: Ethernet, MTU: 1514, Speed: 10Gbps, Duplex: Full-Duplex, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled,
  Source filtering: Disabled, Flow control: Enabled
  Device flags   : Present Running Down
  Interface flags: Hardware-Down SNMP-Traps Internal: 0x0
  Link flags     : None
  CoS queues     : 8 supported, 8 maximum usable queues
  Current address: 50:c5:8d:a8:2c:8a, Hardware address: 50:c5:8d:a8:2c:8a
  Last flapped   : 2016-10-22 14:10:39 UTC (00:27:55 ago)
  Input rate     : 0 bps (0 pps)
  Output rate    : 0 bps (0 pps)
  Active alarms  : LINK
  Active defects : LINK
  Interface transmit statistics: Disabled

  Logical interface xe-8/0/7.0 (Index 293) (SNMP ifIndex 0)
    Flags: Device-Down SNMP-Traps Encapsulation: ENET2
    Input packets : 10782202 
    Output packets: 95
    Protocol eth-switch
      Flags: None

Tried resetting the interface on the switch side with test interface xe-8/0/7 restart-auto-negotiation as well as commits of disable then re-enable of the interface, no change.

Tried re-setting from host-side software, and seems to have worked!

In response to ifconfig eth0 down, dmesg had new output:

[Sat Oct 22 14:46:17 2016] failed to kill vid 0081/0 for device eth0

Then in response to ifconfig eth0 up, dmesg gave:

[Sat Oct 22 14:46:36 2016] bnx2x 0000:01:00.0 eth0: using MSI-X  IRQs: sp 37  fp[0] 39 ... fp[15] 54
[Sat Oct 22 14:46:37 2016] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from FINISAR CORP.    part number FTLX1471D3BCL   
[Sat Oct 22 14:46:37 2016] bnx2x 0000:01:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Sat Oct 22 14:46:37 2016] 8021q: adding VLAN 0 to HW filter on device eth0

The ifconfig commands messed up the state of the network stuff on the host (e.g. killed def gw setting, etc), and the software might be in a messy state in general from being offline, so I've rebooted the host now.

BBlack claimed this task.

Seems ok post-reboot, repooled.

BBlack added a subscriber: Joe.

It failed again:

15:32 < icinga-wm> PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100%

@Joe depooled again ~15:37. Leaving it depooled this time, since it's clearly not a one-off thing.

BBlack moved this task from Varnish v4 to General on the Traffic board.
[Fri Feb 10 15:01:57 2017] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from FINISAR CORP.    part number FTLX1471D3BCL   
[Fri Feb 10 15:01:58 2017] bnx2x 0000:01:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit

I brought the system back up now (ifconfig eth0 down ; ifconfig eth0 up) and did a bit of puppet agent dance.

Let's keep an eye on the machine without repooling for now, as it's not exactly reliable.

That didn't last long:

[Sat Feb 11 03:37:44 2017] bnx2x 0000:01:00.0 eth0: NIC Link is Down

Also worth mentioning, /var/log/mcelog contains multiple instances of the following error:

STATUS 8c000043000800c0 MCGSTATUS 0
MCGCAP 1000c14 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 8 
MISC 122100100010068c ADDR 160cbf6000 
TIME 1482798331 Tue Dec 27 00:25:31 2016
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR
Transaction: Memory scrubbing error
MemCtrl: Corrected patrol scrub error

@Cmjohnson ideas on how to proceed? In its current state the system is just wasting power.

@faidon, I will swap out the sfp+ ...that is the most typical culprit. Do we need to schedule downtime? or can I do anytime?

I believe @ema has depooled it, so any time should be OK.

@Cmjohnson yes the system is indeed depooled. Please go ahead whenever it is convenient for you. Thanks!

@ema the sfp has been replaced I see a link light now. LMK if that fixes the problem.

Mentioned in SAL (#wikimedia-operations) [2017-02-27T11:04:45Z] <ema> rebooting cp1052 into kernel 4.4.2-3+wmf8 T148891