Page MenuHomePhabricator

asw-a2-codfw unresponsive
Closed, ResolvedPublicRequest

Description

asw-a2-codfw went down on 2021-07-16 and did not return when we had remote hands powercycle it.

This switch will need to be swapped out with one of the spare codfw switches.

Once swapped, a support ticket should be opened with JTAC to get the current asw-a2-codfw repaired.

initial task description

asw-a2-codfw went unresponsive today, and doesn't respond to any serial or ssh into that particular switch.

As codfw is currently live, leaving this switch broken over the weekend is non-ideal, but having CyrusOne remote hands swap for a spare is a big ask.

The stop gap measure seems to be to request them to reboot the single switch via removal of its power.


Follows-up https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-16_asw-a2-codfw_network

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH removed a project: Infrastructure-Foundations.

Please note I'm not putting this request into CyrusOne until after Arzhel confirms they are ready for this step.

First log of an issue was this sent from the master switch in the virtual-chassis:

Jul 16, 2021 @ 13:15:55.000 %-SNMP_TRAP_LINK_DOWN: ifIndex 927, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.0

Just recording for timestamp.

@RobH please proceed as discussed and ask remote hands to power cycle this device. We have placed all port configuration for members of that switch to disable on the VC master, so it should not affect anything if it suddenly reboots, but hopefully it shows life on the console and we can assess it's health and make a call on whether to bring them back up. May be completely dead requiring replacement of course.

Mentioned in SAL (#wikimedia-operations) [2021-07-16T14:40:50Z] <topranks> Running homer to disable et-0/0/0 on cr1-codfw, which connects to currently dead device asw-a2-codfw T286787

I've opened support ticket rCICF2022508204be with cyrunsone to have them use remote hands to powercycle this.

The switch is a Juniper QFX5100-48S-6Q, labeled asw-a2-codfw, located in U26 (rear facing) with serial TA3713500006. The switch in question is located in rack A2, which is the first row of racks when entering our cage. A2 should be labeled.

Can you snap a photo or just describe which LEDs are illuminated both before and after powercycle?

This switch has 2 power cables, both will need to be unseated from the back of the chassis at the same time, so all power is removed for 10-15 seconds, before plugging both back in.

Once this has been completed, please let us know!

Remote hands has completed the powercycle of the switch (via removing all power cables). Both before and after power removal, all LEDs are illuminated, which is non-ideal.

@cmooney confirmed no serial output change and switch remains unresponsive.

camera_image_20210716100844372.jpg (2×4 px, 2 MB)

RobH added a subscriber: wiki_willy.

I'll attempt to summarize the IRC discussion.

@ayounsi, @cmooney, and myself discussed how it is likely safer to let a single switch sit broken over the weekend, than have remote hands attempt a swap which may break the entire row/stack and drastically impact codfw's ability to serve the site. It is more likely the swap by remote hands would break something, than it is another switch or the uplink for this row is further impacted.

I was overly cautious and pinged in both @wiki_willy and @paravoid to ensure they were onboard with this plan, all are in agreement to wait until Papaul can return and swap this switch out with one of the spares, and then we can get this switch repaired under JTAC.

Change 704992 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] smokeping: don't poll authdns2001

https://gerrit.wikimedia.org/r/704992

Change 704992 merged by Filippo Giunchedi:

[operations/puppet@production] smokeping: don't poll authdns2001

https://gerrit.wikimedia.org/r/704992

Change 705348 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/dns@master] admin_state: Depool codfw text

https://gerrit.wikimedia.org/r/705348

Change 705348 merged by Vgutierrez:

[operations/dns@master] admin_state: Depool codfw text

https://gerrit.wikimedia.org/r/705348

Mentioned in SAL (#wikimedia-operations) [2021-07-19T16:40:04Z] <XioNoX> update asw-a2-codfw serial number - T286787

Mentioned in SAL (#wikimedia-operations) [2021-07-19T17:04:07Z] <XioNoX> enable cr1-codfw / et-0/0/0 - T286787

Mentioned in SAL (#wikimedia-operations) [2021-07-19T17:10:26Z] <XioNoX> enable asw-a2-codfw access ports - T286787

switch backup online and Netbox update

Case Number:2021-0719-0629 create with Juniper

Dear Juniper Networks Customer,

A Return to Factory (RTF) RMA has been created. Details of which are provided below.

  • RMA DETAILS *

RMA Number: R200361905
Defective Line Item Number: 100
Defective Serial Number:
Defective Product ID: QFX5100-48S-AFI
Defective Product Description: QFX5100, 48x10G+6x40G, 2 AC, BF
Service Level: RETURN TO FACTORY REPAIR OR REPLACE

Please note the following actions that you need to carefully follow:

a) Since this is an RTF RMA, you need to return the defective part to Juniper. Upon receipt, the defective part will be repaired (or replaced) and shipped back to you.

b) The defective part needs to be packaged using standard export packaging and shipped to the appropriate Juniper receiving location based in your country. To view a list of these locations kindly refer to Global RMA Returns Locations. Follow the labeling instructions provided on this page to ensure proper receipt by Juniper and subsequent handling. Please also, refer to the region specific return instruction documents available on the same web page.

c) The costs associated with defective return shipment including any local taxes are the sole responsibility of the customer. Juniper Networks will NOT assist with return shipping costs under the RTF service level agreement.

d) Return only the stated defective part. Juniper Networks will NOT be held responsible for the return of any ADDITIONAL items shipped against to us in error against an RMA.

e) If this RMA is for a DEFECTIVE CHASSIS please REMOVE ALL Components inside the chassis and ONLY return the CHASSIS. DO NOT return mounting brackets, filters or cables along with the chassis unit. DO NOT RETURN chassis power supply units, any cards, fan trays of the chassis unless separate defective line items have already been created for it.

f) If this RMA is for a DEFECTIVE CARD, ONLY return the card and no other components connected to it. DO NOT RETURN optic transceivers within interface cards nor any cables.

g) You can update the RMA via the Case Manager portal and inform us of the carrier and tracking number by which you have sent the defective part.

h) If Juniper does not receive the defective part within 45 days from the date of this notification, the RMA will be closed.

  • DELIVERY OF REPLACEMENT PART BY JUNIPER *
  • Upon receipt of the defective part at the dedicated Repair Center, the defective part will be repaired or replaced and shipped to you from the repair center. Further shipping details will follow, pending air transit and customs clearance procedures.
  • Our local logistics vendor may contact you directly for further information and/or any permits required in order to complete the delivery. The delivery of your replacement may be delayed if you are unable to provide this information.

switch shipped out today tracking information below

Tracking Number:
1ZA19A021295420730

Dear Juniper Networks Customer,

Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juniper has received the following defective part at our return center location.

Your replacement part associated with RMA R200361905 Item # 100 has been successfully shipped. Details of which are provided below.

Replacement Serial Number:
Replacement Line Item Number: 110
Replacement Product ID: QFX5100-48S-AFI
Replacement Product Description: QFX5100, 48x10G+6x40G, 2 AC, BF
Sent by Carrier: UPS
Tracking Number: 1Z7AF3881256309562
Tracking URL: https://wwwapps.ups.com/WebTracking/track?track=yes&trackNums=1Z7AF3881256309562

Received the replacement switch. Rack in C1 U43. setup the mgmt password same as the server mgmt password. Update Netbox with new serial number.

Krinkle updated the task description. (Show Details)