Page MenuHomePhabricator

Switch on rack C7 in codfw is down
Closed, ResolvedPublic

Description

Hi everybody,

elukey@asw-c-codfw> show chassis fpc detail 7
Slot 7 information:
  State                               Online
  Temperature                      38 degrees C / 100 degrees F
  Total CPU DRAM                 1953 MB
  Total SRAM                        0 MB
  Total SDRAM                       0 MB
  Start time                          2020-11-15 05:41:06 UTC
  Uptime                              2 hours, 18 minutes, 31 seconds    <=====================

This caused all the nodes in rack C7 codfw to alert (host down) multiple times, starting at around 2020-11-15 05:26 UTC.

It then went down for good at around 2020-11-15 09:27 UTC.

Event Timeline

elukey created this task.Sun, Nov 15, 8:09 AM
Restricted Application added a project: Operations. · View Herald TranscriptSun, Nov 15, 8:09 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Went down again, but this time no recovery..

elukey renamed this task from Switch on rack C7 in codfw got rebooted to Switch on rack C7 in codfw is down.Sun, Nov 15, 10:15 AM
elukey triaged this task as High priority.
elukey updated the task description. (Show Details)
elukey added a comment.EditedSun, Nov 15, 10:18 AM

Current impact:

  • purged on some cp2/cp4 nodes got stuck while connecting to kafka-main2003, a manual restart was needed.
  • the kafka-main cluster is currently in reduced capacity (2 nodes instead of 3)
  • some ms-be backend nodes are down but it should be fine (in theory)
  • ES codfw master nodes were hit, but the cluster is recovering (Gehel checked)
  • lvs2007's leg in row C is impaired
  • cloudbackup2002 down (@Bstorm to verify the impact)

Currently it seems that we are not really terribly on fire (and it is good since a rack down shouldn't be that dramatic).

ayounsi added a subscriber: ayounsi.EditedSun, Nov 15, 10:41 AM

Down around Nov 15 09:28:34 UTC.

Console is unresponsive.

Opend JTAC case 2020-1115-0083.

If it's stable I'd prefer to leave it as it, as it's a spine.

If it's not then we will have to take it out of the VC.

ayounsi added a subscriber: Papaul.

We also have spares QFX5100, so on monday we can swap the dead one.

Netbox device and list of connected servers: https://netbox.wikimedia.org/dcim/devices/1892/

Mentioned in SAL (#wikimedia-operations) [2020-11-15T11:12:35Z] <vgutierrez> depooling lvs2007, lvs2010 taking over text traffic on codfw - T267865

switching over to lvs2010 as it will allow us to recover cp2035, only losing cp2037 on text and cp2038 on upload VS losing cp2035 and cp2037 on text with lvs2007

Mentioned in SAL (#wikimedia-cloud) [2020-11-15T11:21:26Z] <arturo> icinga downtime cloudbackup2002 for 48h (T267865)

CDanis added a subscriber: CDanis.Sun, Nov 15, 1:51 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:03:38Z] <cdanis> T267867 T267865 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b2 -s10 'A:cp and A:codfw' 'systemctl restart purged'

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:10:34Z] <cdanis> restart some purgeds in ulsfo as well T267865 T267867

Mentioned in SAL (#wikimedia-operations) [2020-11-16T00:09:08Z] <elukey> sudo cumin 'cp2028* or cp2036* or cp2039* or cp4022* or cp4025* or cp4028* or cp4031*' 'systemctl restart purged' -b 3 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T00:19:39Z] <elukey> run 'systemctl mask kafka' and 'systemctl mask kafka-mirror-main-eqiad_to_main-codfw@0' on kafka-main2003 (for the brief moment when it was up) to avoid purged issues - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T00:55:47Z] <shdubsh> re-applied mask to kafka and kafka-mirror-main-eqiad_to_main-codfw@0 on kafka-main2003 and disabled puppet to prevent restart - T267865

Just added two days of downtime to all the hosts in the rack, hopefully it will be less spammy.

As follow up of this task I think that we should prioritize T225005, having only 3 kafka-main brokers is scary in these situations :(

Mentioned in SAL (#wikimedia-operations) [2020-11-16T08:08:37Z] <XioNoX> asw-c-codfw> request system power-off member 7 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T08:32:39Z] <XioNoX> asw-c-codfw> request system power-off member 7 - T267865

Icinga downtime for 1 day, 0:00:00 set by dcaro@cumin1001 on 1 host(s) and their services with reason: The switch it depends on is down

cloudbackup2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-11-16T16:16:43Z] <XioNoX> update c7 serial in row C VC config - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:21:56Z] <vgutierrez> repooling cp2037 and cp2038 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:24:20Z] <vgutierrez> switching back from lvs2010 to lvs2007 - T267865

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:36:14Z] <volans> moved interfaces in Netbox from old to new switch - T267865

Volans added a subscriber: Volans.Mon, Nov 16, 5:38 PM

I've run this piece of code to migrate the interfaces from the old to the new device in a Netbox nbshell.

import uuid
request_id = uuid.uuid4()
user = User.objects.get(username='volans')
device = Device.objects.get(id=1892)

for iface in device.interfaces.all():
    iface.device_id = 235
    log = iface.to_objectchange('update')
    log.request_id = request_id
    log.user = user
    log.save()
    iface.save()
ayounsi closed this task as Resolved.Mon, Nov 16, 5:40 PM
ayounsi claimed this task.

Thanks, I think we're all done here.

RMA in T267950.

Mentioned in SAL (#wikimedia-operations) [2020-11-16T17:48:45Z] <elukey> enable and run puppet on kafka-main2003 (it will start kafka services) - T267865