asw-a7-codfw is down
Closed, ResolvedPublic

Description

asw-a7-codfw went down at Jan 6th 08:37 UTC. Switch's console is unresponsive, maybe hardware fault? These servers are down:

xe-7/0/0                   mc2004
xe-7/0/1                   mc2005
xe-7/0/2                   mc2006
xe-7/0/3                   cp2004
xe-7/0/4                   ms-fe2002
xe-7/0/5                   cp2005
xe-7/0/6                   cp2006
xe-7/0/7                   ms-be-2017
xe-7/0/45                  lvs2006-eth1
xe-7/0/46                  lvs2005-eth1
xe-7/0/47                  lvs2004-eth1
et-7/0/52                  << cr2-codfw:et-0/0/0 {#10706} [40Gbps DF]

Fallout:

  • Since 3 cp* are down, @ema/@Joe depooled cp* (3188f9dd078fa8c2d21eeeebb136657cc459926c, 96b9b88d94d3d044972de9580c9cfa5eaab11949). I assume this due to its lack of redundancy and miss rates, since otherwise the clusters (probably?) work.
  • On lvs'es, acamar (a row A server) is first in resolv.conf and lvs2006's pybal wasn't happy with the delays and marked everything as down. It's the backup lvs2006, so not user-facing issues but still something to follow up on. I've commented it out on lvs2006 as a manual hack; needs further investigation (T154759). puppetmaster2001 is on row A as well, so puppet is currently unreachable from lvs2004/5/6 (minor).
  • Due to cr2-codfw<->asw-a-codfw turning into down (and probably due to the fact that cr2 is the VRRP master) there was a very small flap. This caused issues for ElasticSearch (now recovering) and alert spam.
faidon created this task.Jan 6 2017, 9:41 AM
Restricted Application added subscribers: Jay8g, TerraCodes, Aklapper. · View Herald TranscriptJan 6 2017, 9:41 AM
ema added a comment.Jan 6 2017, 10:46 AM

The impact on varnish errors has been minimal. In codfw we've had two hiccups, one at 8:30 and another smaller one at 8:40

In ulsfo we had a small 503 spike at 8:51

We depooled codfw in DNS at about 8:41, while the traffic route change from ulsfo -> codfw to ulsfo -> eqiad happened ~ 8:50.

mark added a subscriber: mark.Jan 6 2017, 12:14 PM
Avcipsids set Security to Software security bug.Jan 6 2017, 12:37 PM
Avcipsids added a project: Security.
Avcipsids changed the visibility from "Public (No Login Required)" to "Custom Policy".
Avcipsids added a subscriber: Avcipsids.
Restricted Application removed a subscriber: TerraCodes. · View Herald TranscriptJan 6 2017, 12:37 PM
Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".
Restricted Application added a project: Security. · View Herald TranscriptJan 6 2017, 2:12 PM
mark added a comment.Jan 6 2017, 3:40 PM

The switch has been brought back up with a hard power cycle.

We don't have a real indication yet of why it crashed, there's nothing concrete in the logs (local or otherwise), other than the entire stack having been generally unhappy/desynced for several days with messages like:

Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_OPEN_TIMEOUT: Timeout connecting to peer 'lacp'
Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_WRITE_LOSTCONN: Lost connection to peer 'app-engine-management-service'
Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_SELECT_FAILED: select failed for peer app-engine-management-service: Bad file descriptor
Jan  4 22:05:12  asw-a-codfw /kernel: et-7/0/52: get tlv ppfeid 0et-2/0/51: get tlv ppfeid 0et-2/0/52: get tlv ppfeid 0et-7/0/51: get tlv ppfeid 0et-7/0/52: get tlv ppfeid 0
Jan  5 14:42:55  asw-a-codfw /kernel: ge-8/0/6: get tlv ppfeid 0
mark lowered the priority of this task from Unbreak Now! to Normal.Jan 6 2017, 3:40 PM

Mentioned in SAL (#wikimedia-operations) [2017-01-06T16:05:19Z] <ema> wiping codfw caches T154758

Change 332526 had a related patch set uploaded (by Ema):
Revert "Route around codfw, network issues there"

https://gerrit.wikimedia.org/r/332526

Change 332526 merged by Ema:
Revert "Route around codfw, network issues there"

https://gerrit.wikimedia.org/r/332526

faidon closed this task as Resolved.Jan 25 2017, 11:06 AM
faidon claimed this task.

Nothing more to do here.