Page MenuHomePhabricator

Multiple host down alerts from rack C2
Closed, ResolvedPublic

Description

Hi Papaul!

I see multiple nodes from Rack C2 reported down by icinga, anything happened to it? Maybe PSUs-related?

if you are a service owner of any of the servers listed below please check the box if you are able to depool the server on April 27th before 10:30am CT ,time set to replace the faulty switch. This will take approximately 1 hour or lest. Thanks

The servers below are just the once in rack C2: see https://netbox.wikimedia.org/dcim/devices/?rack_id=60

  • ms-be -- ok to loose connectivity for a while, no depool @fgiunchedi
  • moss -- not in service yet @fgiunchedi
  • kafka logging -- not in service yet (T279342) @fgiunchedi
  • elastic -- Search team will monitor during switch replacement; no need to depool/ban from es cluster before replacement
  • dns - @BBlack - can do, Traffic needs to manual-depool before outage
  • cp - @BBlack - can do, Traffic needs to manual-depool before outage
  • lvs - @BBlack - can do, Traffic needs to manual-depool before outage

https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-06_partial_rack_codfw

Event Timeline

elukey triaged this task as High priority.Apr 6 2021, 5:16 PM
elukey created this task.

Papaul moved the affected hosts to new switch ports, connectivity restored.

Follow up to Juniper to replace the switch or the failed parts.

Case Number:2021-0406-0609 create

David Valverde

12:39 AM (9 hours ago)

to me, support

Good night Papaul

Hope you’re doing well

I would like to inform that this RMA was already processed by the Logistics Team, let me provide you the corresponding information about it.

RMA-ID: R200345406

Serial number:

Product: QFX5100-48S-AFI

This is an RTF RMA, you need to return the defective part to Juniper. Upon receipt, the defective part will be repaired (or replaced) and shipped back to you.

The Logistics Team will handle the RTF RMA case directly, separately from this Technical Case, they will notify you once the defective equipment has been received and then they will provide you the information for the new switch that will be sent to you and the Tracking number as well, thank you for your patience.

Good morning Papaul

Hope you’re doing well

Thank you for your response, let me answer the additional question that you have today.

What do we do with the license we currently have associated with that serial number?

Answer:

Once the unit has been replaced and you have the New unit, just use the following KB to transfer the license that is currently associated with the defective unit.

Transfer license keys to an RMA replacement device

https://kb.juniper.net/InfoCenter/index?page=content&id=KB13500&actp=METADATA

Ok, because of this RTF RMA we're going to replace the switch with a spare.
@Papaul Let's chat on IRC to figure out what time would works best for you, then we can notify services owners and draft up a plan.

This comment was removed by Papaul.
Volans subscribed.

@ayounsi do we have already a plan for how to manage the swap in Netbox? Should we discuss it?

Because it's part of a VC, the easiest is to swap serial# (and other attributes like procurement task).

Mentioned in SAL (#wikimedia-operations) [2021-04-27T14:33:28Z] <bblack> dns2001 - depooling for T279457 (disable puppet + stop bird)

Mentioned in SAL (#wikimedia-operations) [2021-04-27T14:36:39Z] <bblack> cp203[56] - depool all etcd services via confctl - T279457

Mentioned in SAL (#wikimedia-operations) [2021-04-27T14:47:20Z] <bblack> lvs2009 - disable puppet + stop pybal (internal services will move to lvs2010, please avoid LVS service definition changes for now!) - T279457

Traffic stuff (lvs/cp/dns) is depooled, downtimed, and ready for the network fixups.

Note to our future selves: we forgot to consider the cross-row LVS connections in this downtime: lvs2008 and lvs2010 do not live in row C at all, but had cross-row connections via C2 to reach all the rest of the service hosts in row C!

Mentioned in SAL (#wikimedia-operations) [2021-04-27T18:11:04Z] <bblack> dns2001 - restarting bird to repool, then re-enabling puppet - T279457

Mentioned in SAL (#wikimedia-operations) [2021-04-27T18:20:32Z] <bblack> cp203[56] - repooling in etcd - T279457

Mentioned in SAL (#wikimedia-operations) [2021-04-27T18:32:19Z] <bblack> lvs2009 - restart pybal + re-run puppet agent - T279457

Traffic lvs/cp/dns are all repooled, un-downtimed, and green.

Waiting until the other C2 hosts are fully reconfigured (network ports) before re-pooling codfw at the public traffic level.

Change 683041 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] Revert "Depool codfw traffic"

https://gerrit.wikimedia.org/r/683041

Change 683041 merged by BBlack:

[operations/dns@master] Revert "Depool codfw traffic"

https://gerrit.wikimedia.org/r/683041

Mentioned in SAL (#wikimedia-operations) [2021-04-27T20:32:46Z] <bblack> re-pooling codfw public traffic - T279457

switch replace, onsite work complete and Netbox updated. Will be shipping the faulty switch tomorrow.

Papaul lowered the priority of this task from High to Low.Apr 27 2021, 10:22 PM

The faulty switch was delivered to Juniper today.

Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juniper has received the following defective part at our return center location.

RMA Number: R200345406
Defective Line Item Number: 100
Defective Serial Number: 
Defective Product ID: QFX5100-48S-AFI
Defective Product Description: QFX5100, 48x10G+6x40G, 2 AC, BF
Service Level: RETURN TO FACTORY REPAIR OR REPLACE

Received the new switch. Resolving this task