Page MenuHomePhabricator

asw-a7-codfw is down
Closed, ResolvedPublic

Description

asw-a7-codfw went down at Jan 6th 08:37 UTC. Switch's console is unresponsive, maybe hardware fault? These servers are down:

xe-7/0/0                   mc2004
xe-7/0/1                   mc2005
xe-7/0/2                   mc2006
xe-7/0/3                   cp2004
xe-7/0/4                   ms-fe2002
xe-7/0/5                   cp2005
xe-7/0/6                   cp2006
xe-7/0/7                   ms-be-2017
xe-7/0/45                  lvs2006-eth1
xe-7/0/46                  lvs2005-eth1
xe-7/0/47                  lvs2004-eth1
et-7/0/52                  << cr2-codfw:et-0/0/0 {#10706} [40Gbps DF]

Fallout:

  • Since 3 cp* are down, @ema/@Joe depooled cp* (3188f9dd078fa8c2d21eeeebb136657cc459926c, 96b9b88d94d3d044972de9580c9cfa5eaab11949). I assume this due to its lack of redundancy and miss rates, since otherwise the clusters (probably?) work.
  • On lvs'es, acamar (a row A server) is first in resolv.conf and lvs2006's pybal wasn't happy with the delays and marked everything as down. It's the backup lvs2006, so not user-facing issues but still something to follow up on. I've commented it out on lvs2006 as a manual hack; needs further investigation (T154759). puppetmaster2001 is on row A as well, so puppet is currently unreachable from lvs2004/5/6 (minor).
  • Due to cr2-codfw<->asw-a-codfw turning into down (and probably due to the fact that cr2 is the VRRP master) there was a very small flap. This caused issues for ElasticSearch (now recovering) and alert spam.

Event Timeline

The impact on varnish errors has been minimal. In codfw we've had two hiccups, one at 8:30 and another smaller one at 8:40

2017-01-06-codfw-503s.png (873×1 px, 89 KB)

In ulsfo we had a small 503 spike at 8:51

2017-01-06-ulsfo-503.png (873×1 px, 129 KB)

We depooled codfw in DNS at about 8:41, while the traffic route change from ulsfo -> codfw to ulsfo -> eqiad happened ~ 8:50.

Avcipsids added a project: acl*security.
Avcipsids changed the visibility from "Public (No Login Required)" to "Custom Policy".
Avcipsids subscribed.
Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".

The switch has been brought back up with a hard power cycle.

We don't have a real indication yet of why it crashed, there's nothing concrete in the logs (local or otherwise), other than the entire stack having been generally unhappy/desynced for several days with messages like:

Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_OPEN_TIMEOUT: Timeout connecting to peer 'lacp'
Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_WRITE_LOSTCONN: Lost connection to peer 'app-engine-management-service'
Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_SELECT_FAILED: select failed for peer app-engine-management-service: Bad file descriptor
Jan  4 22:05:12  asw-a-codfw /kernel: et-7/0/52: get tlv ppfeid 0et-2/0/51: get tlv ppfeid 0et-2/0/52: get tlv ppfeid 0et-7/0/51: get tlv ppfeid 0et-7/0/52: get tlv ppfeid 0
Jan  5 14:42:55  asw-a-codfw /kernel: ge-8/0/6: get tlv ppfeid 0
mark lowered the priority of this task from Unbreak Now! to Medium.Jan 6 2017, 3:40 PM

Change 332526 had a related patch set uploaded (by Ema):
Revert "Route around codfw, network issues there"

https://gerrit.wikimedia.org/r/332526

Change 332526 merged by Ema:
Revert "Route around codfw, network issues there"

https://gerrit.wikimedia.org/r/332526

faidon claimed this task.

Nothing more to do here.