Page MenuHomePhabricator

eeden ethernet outage
Closed, ResolvedPublic

Description

Today eeden.wikimedia.org aka ns2.wikimedia.org was offline from the network for ~14 minutes. The icinga alert boundaries in IRC were 16:05 -> 16:19 UTC.

I was able to log in manually on the console as root and confirm no apparent local issues on the machine. It was un-pingable even from other hosts on the same network and could not ping its default gateway. tcpdump confirmed normal ns2 DNS requests were flowing into it on the ethernet port, and that it was responding on the ethernet port as well. However, on its switch (csw2-esams), the interface statistics were showing a normal-ish outbound pps (elsewhere->eeden), but zero pps inbound (eeden->elsewhere). There were no link state changes anywhere near this time window for this port.

Can we figure out why this happened from switch logs?

Event Timeline

BBlack created this task.Sep 22 2016, 5:31 PM
Restricted Application added a project: Operations. · View Herald TranscriptSep 22 2016, 5:31 PM
Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald Transcript

Nothing appears abnormal in the logs of either csw2, asw nor cr2. Which other hosts on the same network did you try from? I'm interested to find out if they connected on csw2 or asw, and if it was the former, on which stack member.

If this happens again and I'm not around, it'd be helpful to check:

  • connectivity from hosts in the same stack member, same switch stack and different stack member & different switch stack (csw2 vs. asw) entirely.
  • if IPv6 works
  • what's ARP saying on both the host (arp -n) and the router (show arp no-resolve) — i.e. if broadcasts work and unicast don't, or vice-versa.
grin added a subscriber: grin.Sep 23 2016, 5:58 AM

(testing lurking on phabricator made me see this ;-))
my 2'cents: since defgw was not pingable I'd check (apart from arp) irqs on the machine, I suspect you've checked that there was nothing in syslog saying stuck ethernet rings or device. if it was on v6 the gw may play tricks but it's usually doesn't happen on static v4 configs.
as a sidenote this also happen on cabling problems when only one wire is faulty (no link loss but loss of one direction), usually happens when someone's fiddling around. switch hardly can say anything useful, much more helpful would be the counters on the machine eth.
sorry for chiming in. :-)

sorry for chiming in. :-)

No reason to be sorry — thanks for the input!

BBlack moved this task from Triage to DNS Infra on the Traffic board.Sep 30 2016, 2:07 PM
faidon closed this task as Resolved.Oct 6 2016, 1:58 PM
faidon claimed this task.

Two weeks have passed and this hasn't reoccurred. I'm going to resolve this for now — we can reopen if it happens again or if we have more information about it.

BBlack reopened this task as Open.Oct 13 2016, 4:04 PM

Down again! Assuming for the moment it's ethernet again...

Noting that there are no errors in TX or RX on the interface of neither the host nor the switch.

This happened twice yesterday, unfortunately during the GlobalSign event. I investigated it both times, but in both times the downtime was brief which limited my troubleshooting time.

FTR, this is what I found:

  • IPv6 worked throughout. In fact, I ssh'ed into the machine normally.
  • arp -n on the box showed "(incomplete)" for the gateway (.1). tcpdumps confirmed who-has being sent out but not being received.
  • Pings to .2/.3 worked fine, but these of course are to different MAC addresses (.1 is a VRRP MAC)
  • cr2-esams show arp no-resolve showed eeden's MAC address just fine. The issue was only present in the opposite direction.
  • Logs on switches/routers didn't show anything.

If I had to bet, I'd bet on a csw2-esams bug, but it's hard to tell :(

grin added a comment.Oct 14 2016, 1:18 PM

The time the link went away has there been any VRRP change?

(Either .1 didn't get/accept the arp req or havent answered it, or answered it on a different interface, I'd say without looking into the architecture. )

elukey added a subscriber: elukey.Oct 14 2016, 5:04 PM

Happened again today (UTC timezone):

18:32  <icinga-wm> PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100%
18:33  <icinga-wm> PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[..]
18:41  <icinga-wm> RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms
18:41  <icinga-wm> RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 84.01 ms
ayounsi moved this task from Backlog to Troubleshooting on the netops board.Jun 27 2017, 2:49 PM

This hasn't happened in a long time, should we just resolve?

faidon closed this task as Resolved.Aug 29 2017, 11:29 AM