Page MenuHomePhabricator

eqiad row C switch fabric recabling
Closed, ResolvedPublic

Description

For historical context:
When last refreshed, the virtual-chassis got cabled in a similar way to the previous generation ones (mix of daisy-chain and spine/leaf) in order to save ports on the spines.
Unfortunately this happened to be unsupported and caused various instabilities (fixed for example in T256112: eqiad row D switch fabric recabling).
Row C, most likely because it was one rack smaller (C1 is for Fundraising only) never showed those issues. As the recabling is a risky operations it was decided to leave it as it.

Today's T313382: asw2-c5-eqiad crash had an impact on the other racks in row C, due to how the traffic flows from some racks to the routers (LACP grouping + VRRP primary) while it should only have impacted servers in C5.
Another downside is that Juniper support might not want to move forward with any investigation as long as the cabling is not up to standards.

The ideal would have been to not need to do any intrusive action until we need to refresh the hardware, but that won't happen sooner than in 1 or 2 years.

To improve the situation:

  1. In this task, re-cable the VC so each "leaf" have a link to the "spines"
  2. T308339: eqiad: move non WMCS servers out of rack C8 will reduce the size of the VC
  3. T304712: eqiad: Move links to new MPC7E linecard Replace the current LACP setup (eg. ae1 have 2 members to fpc2 and 2 to fpc7, same with ae2) with 1x40G per router, this will make traffic flows more straightforward

See bellow diagram for a "current/final" view.

Virtual Chassis Fabric-eqiad row C.drawio.png (971×846 px, 109 KB)

Next step for this task is to buy the necessary DACs

Event Timeline

ayounsi triaged this task as High priority.Jul 20 2022, 7:31 AM
ayounsi created this task.
ayounsi mentioned this in Unknown Object (Task).Jul 20 2022, 8:12 AM
ayounsi added a subtask: Unknown Object (Task).

Agreed this is a good idea. I can see why it may have been "left alone" previously but given we'd had issues best to bite the bullet and do it.

The 40G uplinks will likely improve things also. +1

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Aug 17 2022, 6:56 PM

I see that the subtask got resolved, nice!

Please run the new/additional cables without connecting them.
Once done we will schedule a window to connect them (and remove the ones not needed anymore).

@Jclark-ctr @wiki_willy do you have any update about this task? This is set to high priority as it is to prevent an outage from re-happening.

cableid 2207506656 fpc7 - fpc5
cableid 2207506655 fpc2 - fpc6
cableid 2207506658 fpc7 - fpc3

Ayounsi did you have ports these need to be plugged into yet?

@Jclark-ctr Awesome thanks! We need to schedule a window to do the plugging/unplugging/reconfiguring. Would next Tuesday, Wednesday or Thursday in your morning works? (thinking of a 2h window, but 1h would be enough). Let me know what works best for you and I'll take care of the announcement.

@ayounsi is there a time window you prefer? I can be available 1pm UTC time I am available any day.

Synced on IRC, we're aiming at Thursday 1pm UTC.

Plan of action:
General overview before/after. Red: deactivated/removed. Green: activated/added.

Virtual Chassis Fabric-eqiad row C.drawio.png (752×723 px, 98 KB)

We're going to work sequentially leaf after leaf, first removing the invalid interfaces, links then adding the correct ones. This means the leaves will temporarily be one legged, but it's better as the other way around as by experience adding too many links causes instabilities.

1/ [dcops] Add 40G optics to ports 0/51 and 0/52 on FPC2 and FPC7
2/ [dcops] Connect the new cables on the FPC2/FPC7 sides only
3/ [netops] Enable the future VC ports on the spines

request virtual-chassis vc-port set pic-slot 0 port 51 member 2
request virtual-chassis vc-port set pic-slot 0 port 52 member 2
request virtual-chassis vc-port set pic-slot 0 port 51 member 7
request virtual-chassis vc-port set pic-slot 0 port 52 member 7

4/ [dcops] disconnect cable on fpc3:1/1
5/ [dcops] connect cable between fpc3:1/1 and fpc7:0/51
6/ [netops] verify fpc3 have fpc2 and fpc7 as peers show virtual-chassis vc-port member 3
7/ [netops] disable old fpc5 vc-ports

request virtual-chassis vc-port delete pic-slot 1 port 1 member 5
request virtual-chassis vc-port delete pic-slot 1 port 2 member 5

8/ [dcops] disconnect cables on fpc5:1/0, fpc5:1/1, fpc5:1/2
9/ [dcops] connect cable between fpc5:1/0 and fpc7:0/52
10/ [netops] verify fpc5 have fpc2 and fpc7 as peers show virtual-chassis vc-port member 5
11/ [netops] disable old fpc6 vc-ports

request virtual-chassis vc-port delete pic-slot 1 port 1 member 6
request virtual-chassis vc-port delete pic-slot 1 port 3 member 6

12/ [dcops] disconnect cables on fpc6:1/0, fpc6:1/1, fpc6:1/3
13/ [dcops] connect cable between fpc6:1/0 and fpc2:0/51
14/ [netops] verify fpc6 have fpc2 and fpc7 as peers show virtual-chassis vc-port member 6
15/ [dcops] disconnect cables on fpc8:1/0
16/ [dcops] connect cable between fpc8:1/0 and fpc2:0/52
17/ [netops] verify fpc8 have fpc2 and fpc7 as peers show virtual-chassis vc-port member 8

cableid c220756659 fpc2 - fpc8.

This has been completed smoothly!

I deleted the following VC cables from Netbox:
0315
0316
0317
0318
0320

Please make sure the cables and optics are removed.

And created:
https://netbox.wikimedia.org/dcim/cables/5707/
https://netbox.wikimedia.org/dcim/cables/5708/
https://netbox.wikimedia.org/dcim/cables/5709/
https://netbox.wikimedia.org/dcim/cables/5710/
@Jclark-ctr could you double check all of the above?

If all good you can close the task!

Mentioned in SAL (#wikimedia-operations) [2022-10-11T06:35:58Z] <XioNoX> delete now unused VC ports on asw2-c4-eqiad - T313384