asw-a7-codfw is down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Jan 6 2017, 9:41 AM

Description

asw-a7-codfw went down at Jan 6th 08:37 UTC. Switch's console is unresponsive, maybe hardware fault? These servers are down:

xe-7/0/0                   mc2004
xe-7/0/1                   mc2005
xe-7/0/2                   mc2006
xe-7/0/3                   cp2004
xe-7/0/4                   ms-fe2002
xe-7/0/5                   cp2005
xe-7/0/6                   cp2006
xe-7/0/7                   ms-be-2017
xe-7/0/45                  lvs2006-eth1
xe-7/0/46                  lvs2005-eth1
xe-7/0/47                  lvs2004-eth1
et-7/0/52                  << cr2-codfw:et-0/0/0 {#10706} [40Gbps DF]

Fallout:

Since 3 cp* are down, @ema/@Joe depooled cp* (3188f9dd078fa8c2d21eeeebb136657cc459926c, 96b9b88d94d3d044972de9580c9cfa5eaab11949). I assume this due to its lack of redundancy and miss rates, since otherwise the clusters (probably?) work.
On lvs'es, acamar (a row A server) is first in resolv.conf and lvs2006's pybal wasn't happy with the delays and marked everything as down. It's the backup lvs2006, so not user-facing issues but still something to follow up on. I've commented it out on lvs2006 as a manual hack; needs further investigation (T154759). puppetmaster2001 is on row A as well, so puppet is currently unreachable from lvs2004/5/6 (minor).
Due to cr2-codfw<->asw-a-codfw turning into down (and probably due to the fact that cr2 is the VRRP master) there was a very small flap. This caused issues for ElasticSearch (now recovering) and alert spam.

Details

	Subject	Repo	Branch	Lines +/-
	Revert "Route around codfw, network issues there"	operations/puppet	production	+4 -4

Customize query in gerrit

Related Objects

Mentioned In: T154759: Pybal not happy with DNS delays
Mentioned Here: T154759: Pybal not happy with DNS delays
rODNS96b9b88d94d3: Temporarily depool codfw
rOPUP3188f9dd078f: Route around codfw, network issues there

Event Timeline

faidon created this task.Jan 6 2017, 9:41 AM

Restricted Application added subscribers: Jay8g, TerraCodes, Aklapper. · View Herald TranscriptJan 6 2017, 9:41 AM

faidon mentioned this in T154759: Pybal not happy with DNS delays.Jan 6 2017, 9:48 AM

faidon updated the task description. (Show Details)

The impact on varnish errors has been minimal. In codfw we've had two hiccups, one at 8:30 and another smaller one at 8:40

2017-01-06-codfw-503s.png (873×1 px, 89 KB)

In ulsfo we had a small 503 spike at 8:51

2017-01-06-ulsfo-503.png (873×1 px, 129 KB)

We depooled codfw in DNS at about 8:41, while the traffic route change from ulsfo -> codfw to ulsfo -> eqiad happened ~ 8:50.

mark subscribed.Jan 6 2017, 12:14 PM

• Avcipsids set Security to Software security bug.Jan 6 2017, 12:37 PM

• Avcipsids added a project: acl*security.

• Avcipsids changed the visibility from "Public (No Login Required)" to "Custom Policy".

• Avcipsids subscribed.

Restricted Application removed a subscriber: TerraCodes. · View Herald TranscriptJan 6 2017, 12:37 PM

• Avcipsids awarded a token.Jan 6 2017, 12:46 PM

• Avcipsids unsubscribed.Jan 6 2017, 12:59 PM

Reedy removed a project: acl*security.Jan 6 2017, 2:12 PM

Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".

Restricted Application added a project: acl*security. · View Herald TranscriptJan 6 2017, 2:12 PM

The switch has been brought back up with a hard power cycle.

We don't have a real indication yet of why it crashed, there's nothing concrete in the logs (local or otherwise), other than the entire stack having been generally unhappy/desynced for several days with messages like:

Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_OPEN_TIMEOUT: Timeout connecting to peer 'lacp'
Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_WRITE_LOSTCONN: Lost connection to peer 'app-engine-management-service'
Jan  4 22:01:15  asw-a-codfw mgd[57604]: UI_SELECT_FAILED: select failed for peer app-engine-management-service: Bad file descriptor
Jan  4 22:05:12  asw-a-codfw /kernel: et-7/0/52: get tlv ppfeid 0et-2/0/51: get tlv ppfeid 0et-2/0/52: get tlv ppfeid 0et-7/0/51: get tlv ppfeid 0et-7/0/52: get tlv ppfeid 0
Jan  5 14:42:55  asw-a-codfw /kernel: ge-8/0/6: get tlv ppfeid 0

mark lowered the priority of this task from Unbreak Now! to Medium.Jan 6 2017, 3:40 PM

Mentioned in SAL (#wikimedia-operations) [2017-01-06T16:05:19Z] <ema> wiping codfw caches T154758