Switch on rack C7 in codfw is down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Nov 15 2020, 8:09 AM

Description

Hi everybody,

elukey@asw-c-codfw> show chassis fpc detail 7
Slot 7 information:
  State                               Online
  Temperature                      38 degrees C / 100 degrees F
  Total CPU DRAM                 1953 MB
  Total SRAM                        0 MB
  Total SDRAM                       0 MB
  Start time                          2020-11-15 05:41:06 UTC
  Uptime                              2 hours, 18 minutes, 31 seconds    <=====================

This caused all the nodes in rack C7 codfw to alert (host down) multiple times, starting at around 2020-11-15 05:26 UTC.

It then went down for good at around 2020-11-15 09:27 UTC.

Related Objects
Search...

Status	Assigned	Task
Resolved	ayounsi	T267865 Switch on rack C7 in codfw is down
Open	None	T267867 purged is not resilient to kafka main nodes going down
Resolved	Gehel	T267903 Unallocated shards on elasticsearch codfw cluster after switch failure

Event Timeline

elukey created this task.Nov 15 2020, 8:09 AM

Restricted Application added a project: SRE. · View Herald TranscriptNov 15 2020, 8:09 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Peachey88 mentioned this in T267864: Network flap on cloudbackup2002.Nov 15 2020, 8:25 AM

elukey merged a task: T267864: Network flap on cloudbackup2002.Nov 15 2020, 8:29 AM

elukey added subscribers: • Bstorm, Peachey88.

Went down again, but this time no recovery..

elukey renamed this task from Switch on rack C7 in codfw got rebooted to Switch on rack C7 in codfw is down.Nov 15 2020, 10:15 AM

elukey triaged this task as High priority.

elukey updated the task description. (Show Details)

Current impact:

purged on some cp2/cp4 nodes got stuck while connecting to kafka-main2003, a manual restart was needed.
the kafka-main cluster is currently in reduced capacity (2 nodes instead of 3)
some ms-be backend nodes are down but it should be fine (in theory)
ES codfw master nodes were hit, but the cluster is recovering (Gehel checked)
lvs2007's leg in row C is impaired
cloudbackup2002 down (@Bstorm to verify the impact)

Currently it seems that we are not really terribly on fire (and it is good since a rack down shouldn't be that dramatic).

Down around Nov 15 09:28:34 UTC.

Console is unresponsive.

Opend JTAC case 2020-1115-0083.

If it's stable I'd prefer to leave it as it, as it's a spine.

If it's not then we will have to take it out of the VC.

We also have spares QFX5100, so on monday we can swap the dead one.

Netbox device and list of connected servers: https://netbox.wikimedia.org/dcim/devices/1892/

Mentioned in SAL (#wikimedia-operations) [2020-11-15T11:12:35Z] <vgutierrez> depooling lvs2007, lvs2010 taking over text traffic on codfw - T267865

switching over to lvs2010 as it will allow us to recover cp2035, only losing cp2037 on text and cp2038 on upload VS losing cp2035 and cp2037 on text with lvs2007

Mentioned in SAL (#wikimedia-cloud) [2020-11-15T11:21:26Z] <arturo> icinga downtime cloudbackup2002 for 48h (T267865)

CDanis subscribed.Nov 15 2020, 1:51 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:03:38Z] <cdanis> T267867 T267865 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b2 -s10 'A:cp and A:codfw' 'systemctl restart purged'

Stashbot mentioned this in T267867: purged is not resilient to kafka main nodes going down.Nov 15 2020, 10:03 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:10:34Z] <cdanis> restart some purgeds in ulsfo as well T267865 T267867

RhinosF1 subscribed.Nov 15 2020, 10:23 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-16T00:09:08Z] <elukey> sudo cumin 'cp2028* or cp2036* or cp2039* or cp4022* or cp4025* or cp4028* or cp4031*' 'systemctl restart purged' -b 3 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T00:19:39Z] <elukey> run 'systemctl mask kafka' and 'systemctl mask kafka-mirror-main-eqiad_to_main-codfw@0' on kafka-main2003 (for the brief moment when it was up) to avoid purged issues - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T00:55:47Z] <shdubsh> re-applied mask to kafka and kafka-mirror-main-eqiad_to_main-codfw@0 on kafka-main2003 and disabled puppet to prevent restart - T267865

colewhite subscribed.Nov 16 2020, 12:56 AM

Just added two days of downtime to all the hosts in the rack, hopefully it will be less spammy.

As follow up of this task I think that we should prioritize T225005, having only 3 kafka-main brokers is scary in these situations :(

Mentioned in SAL (#wikimedia-operations) [2020-11-16T08:08:37Z] <XioNoX> asw-c-codfw> request system power-off member 7 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T08:32:39Z] <XioNoX> asw-c-codfw> request system power-off member 7 - T267865

Gehel mentioned this in T267903: Unallocated shards on elasticsearch codfw cluster after switch failure.Nov 16 2020, 9:41 AM

Icinga downtime for 1 day, 0:00:00 set by dcaro@cumin1001 on 1 host(s) and their services with reason: The switch it depends on is down

cloudbackup2002.codfw.wmnet

Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Nov 16 2020, 2:17 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-16T16:16:43Z] <XioNoX> update c7 serial in row C VC config - T267865

Spare switch configured.
Relevant doc: https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-adding-device.html

Old (failed): https://netbox.wikimedia.org/dcim/devices/1892/
New (spare): https://netbox.wikimedia.org/dcim/devices/235/
That's where T259166 would be useful.

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:21:56Z] <vgutierrez> repooling cp2037 and cp2038 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:24:20Z] <vgutierrez> switching back from lvs2010 to lvs2007 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:36:14Z] <volans> moved interfaces in Netbox from old to new switch - T267865

I've run this piece of code to migrate the interfaces from the old to the new device in a Netbox nbshell.

import uuid
request_id = uuid.uuid4()
user = User.objects.get(username='volans')
device = Device.objects.get(id=1892)

for iface in device.interfaces.all():
    iface.device_id = 235
    log = iface.to_objectchange('update')
    log.request_id = request_id
    log.user = user
    log.save()
    iface.save()

Volans mentioned this in T259166: Move device attributes.Nov 16 2020, 5:39 PM

Thanks, I think we're all done here.

RMA in T267950.

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:48:45Z] <elukey> enable and run puppet on kafka-main2003 (it will start kafka services) - T267865

elukey mentioned this in T268121: Kafka mirror maker codfw -> eqiad in warning state for low consumer throughput.Nov 18 2020, 9:57 AM

Gehel closed subtask T267903: Unallocated shards on elasticsearch codfw cluster after switch failure as Resolved.Nov 23 2020, 1:15 PM

ayounsi mentioned this in T271701: Restore asw-c7-codfw cables.Jan 11 2021, 10:43 AM

Switch on rack C7 in codfw is downClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Switch on rack C7 in codfw is down
Closed, ResolvedPublic
Actions

Related Objects
Search...