Page MenuHomePhabricator

eqiad row D switch fabric recabling
Closed, ResolvedPublic

Description

Similar to what we did with all the other rows (except eqiad-C so far).

The current way the row D switches are connected to each others doesn't follow Juniper recommended cabling, and could bring instability to the row (eg. as seen with T252797#6137927).
It is also a pre-requisite to add a 3rd 10G switch (T196487).

On the other rows, the recabling went smooth (no user facing impact), except one that caused few seconds downtime across the row.

The plan (and what we did previously) is to pre-cable the missing VC links (with VC Ports disabled) then enable/disable ports in the proper order, finish by removing the obsolete cables.
If any sign of instability, "rip off the bandaid" and turn all the ports on/off at once so we're only left with the good design. While keeping the commands handy to rollback at any time.

This will happened during the DC failover, we need to:

  1. Schedule a 1h window with the service owners and DCops - Thursday 24th, 1pm UTC
  2. For each service owners decide if they can/want to depool their service out of eqiad row D (all servers impacted)
  3. Decide if we want to use that opportunity to do T196487 (which mean a ~30min hard down of rack D4) - yes, next day
  4. Decide if we want to use that opportunity to do T247881 (which mean a ~30min hard down of rack D1) - TBD

A few days before the window:
[DCops] Remove unused cables:

FPC1:1/0 (DAC 3m)
FPC8:1/0 (DAC 3m)

[Netops] Disable unused VC ports to not risk them going online at connect:

request virtual-chassis vc-port delete pic-slot 1 member 3 port 3

request virtual-chassis vc-port delete pic-slot 1 member 4 port 2
request virtual-chassis vc-port delete pic-slot 1 member 4 port 3

request virtual-chassis vc-port delete pic-slot 1 member 5 port 2
request virtual-chassis vc-port delete pic-slot 1 member 5 port 3

request virtual-chassis vc-port delete pic-slot 1 member 6 port 3

request virtual-chassis vc-port delete pic-slot 1 member 8 port 0
request virtual-chassis vc-port delete pic-slot 1 member 8 port 3

[DCops] Pre-cable:

FPC1:1/0 - FPC7:0/51 2xQSFP+-40G-SR4+MMF FIBER (MTP/MPO) 15m?
FPC3:1/3 - FPC7:0/52 7M DAC
FPC4:1/2 - FPC2:0/51 5M DAC
FPC5:1/2 - FPC7:0/53 5M DAC
FPC6:1/3 - FPC2:0/52 7M DAC
FPC8:1/0 - FPC2:0/53 2xQSFP+-40G-SR4+MMF FIBER (MTP/MPO) 15m?

[DCops/Netops] Update Netbox

[Netops] In window: turn VC-ports on/off for proper cabling:

request virtual-chassis vc-port set pic-slot 1 member 1 port 0
request virtual-chassis vc-port set pic-slot 0 member 7 port 51

request virtual-chassis vc-port delete pic-slot 1 member 1 port 1
request virtual-chassis vc-port delete pic-slot 1 member 3 port 0
----------
request virtual-chassis vc-port set pic-slot 0 member 2 port 53
request virtual-chassis vc-port set pic-slot 1 member 8 port 0

request virtual-chassis vc-port delete pic-slot 1 member 6 port 1
request virtual-chassis vc-port delete pic-slot 1 member 8 port 1
----------
request virtual-chassis vc-port set pic-slot 1 member 3 port 3
request virtual-chassis vc-port set pic-slot 0 member 7 port 52

request virtual-chassis vc-port delete pic-slot 1 member 3 port 2
request virtual-chassis vc-port delete pic-slot 1 member 4 port 0

request virtual-chassis vc-port set pic-slot 1 member 4 port 2
request virtual-chassis vc-port set pic-slot 0 member 2 port 51
----------
request virtual-chassis vc-port set pic-slot 1 member 6 port 3
request virtual-chassis vc-port set pic-slot 0 member 2 port 52

request virtual-chassis vc-port delete pic-slot 1 member 5 port 1
request virtual-chassis vc-port delete pic-slot 1 member 6 port 0

request virtual-chassis vc-port set pic-slot 1 member 5 port 2
request virtual-chassis vc-port set pic-slot 0 member 7 port 53

[DCops] Remove now unused cables and update Netbox

FPC1:1/1 - FPC3:1/0
FPC3:1/2 - FPC4:1/0
FPC5:1/1 - FPC6:1/0
FPC6:1/1 - FPC8:1/1

Event Timeline

ayounsi triaged this task as Medium priority.Jun 23 2020, 10:38 AM
ayounsi created this task.

@ayounsi do you have a timeframe for this? The expected downtime is 30 minutes per for the whole row?

@ayounsi do you have a timeframe for this? The expected downtime is 30 minutes per for the whole row?

There is no expected downtime, but the work could cause intermittent packet loss across the row during the re-cabling (eg. unreachable hosts for a few seconds here and there during a 10min window).
The 30min window is to keep some margins.

ayounsi mentioned this in Unknown Object (Task).Aug 26 2020, 11:56 AM
RobH added a subtask: Unknown Object (Task).Aug 31 2020, 5:14 PM

Postponed to Wednesday 16th, 11am UTC, 30min, the time the cables and optics arrive.

Checking the full list of servers, we have 10 ms-be hosts in there. Since we're deploying Swift with row-availability in mind I'm ok not to depool Swift out of eqiad for this. I'll be available for a depool though if there's a row outage and Swift eqiad becomes unavailable as a result.

Change 629681 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool eqiad for row D recabling

https://gerrit.wikimedia.org/r/629681

Change 629681 merged by Ayounsi:
[operations/dns@master] Depool eqiad for row D recabling

https://gerrit.wikimedia.org/r/629681

Mentioned in SAL (#wikimedia-operations) [2020-09-24T13:52:51Z] <XioNoX> depool eqiad for row D recabling - T256112

Mentioned in SAL (#wikimedia-operations) [2020-09-24T14:16:01Z] <XioNoX> [Netops] Disable unused VC ports to not risk them going online at connect: - T256112

Mentioned in SAL (#wikimedia-operations) [2020-09-24T14:28:37Z] <XioNoX> [Netops] In window: turn VC-ports on/off for proper cabling: - T256112

All done

  • we briefly (<5s) lost D1
  • some disabled ports automatically re-enabled themselves, causing some latency issues

1/1 Auto-Configured -1 Up 40000 3 vcp-255/1/0

As the previous cables were not in Netbox there is no need to add those for now.

wiki_willy closed subtask Unknown Object (Task) as Resolved.Oct 26 2020, 4:49 PM