Page MenuHomePhabricator

codfw row C/D upgrade racking task
Open, MediumPublic

Description

This task will track the racking, setup, and OS installation of new spine/leaf switches for codfw ordered via T348062

netops will need to provide a detailed cabling assignment/layout for the purchase of cables via T360671, that same cable diagram can be used here.

Hostname / Racking / Installation Details

This section will need to be filled out by ops-codfw detailing the process and checklist for the upgrade of rows C and D.

SwitchrackedpowermgmtZTPconsolessw1-d1-leaf linkssw1-d8-leaf linkcr1-linkcr2-linkssw1-a1-linkssw1-a8-link
ssw1-d1
ssw1-d8
lsw1-c1
lsw1-c2
lsw1-c3
lsw1-c4
lsw1-c5
lsw1-c6
lsw1-c7
lsw1-d1
lsw1-d2
lsw1-d3
lsw1-d4
lsw1-d5
lsw1-d6
lsw1-d7
lsw1-d8

The cabling will be the same as row A and B. We will be using port 55 on each leaf switch to connect to ssw1-d1 and port 54 to connect to ssw1-d8; et-0/0/31 on each spine will be connected to cr*(1/2)
@ayounsi @cmooney please provide the interfaces to use on cr* for the uplink from ssw1-d1 to cr1 and ssw1-d8 to cr2
et-0/0/30 on each spine will be use to connect spines in row D and C to spines in row A and B
please

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added parent tasks: Unknown Object (Task), Unknown Object (Task).Mar 22 2024, 3:29 PM
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.

@ayounsi @cmooney please provide the interfaces to use on cr* for the uplink from ssw1-d1 to cr1 and ssw1-d8 to cr2

The CR links will be as follows:

SpinePortCRPort
ssw1-d1-codfwet-0/0/31cr1-codfwet-1/1/2
ssw1-d8-codfwet-0/0/31cr2-codfwet-1/0/2

You can go ahead and cable these, however we won't be able to bring up the link on CR1 until we have moved the asw-c-codfw and asw-d-codfw links away from the MPC7E card, see the below plan:

https://docs.google.com/spreadsheets/d/1-5fzirhBtlTSQetv6iWDyCQfHNAF0V9ca5nftDGzrv0

I'll create a task for the move of these asw uplinks with instructions. It might actually make sense to move the asw uplinks to the new spines (like we did for rows A/B) instead of the other CR card, I will detail in the task.

et-0/0/30 on each spine will be use to connect spines in row D and C to spines in row A and B

We'll need two ports on each spine to connect to the two spines in the remote rows. et-0/0/30 as one of them seems fine, I'll map all the links out in Netbox and feed back thanks.

netops will need to provide a detailed cabling assignment/layout for the purchase of cables via T360671, that same cable diagram can be used here.

@Papaul I've added all the links for the new switches in Netbox now:

https://netbox.wikimedia.org/dcim/cables/?color=&device_id=5234&device_id=5235&length=&length_unit=&q=

Please add the cable id's and set to status 'connected' when you are wiring them up. Thanks!

Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

Change #1037651 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/homer/public@master] Add lsw1-c1 to devices.yaml to test homer after ZTP

https://gerrit.wikimedia.org/r/1037651

So I tested pushing with Homer to the devices in row D and it was pretty much successful :)

NOTE: As the devices need some additional data from this patch, which I don't want to merge until we have the cabling done, I had to use a custom homer config file pointing at a local copy of the public repo I made in my home dir on the cumin host. That copy of the repo had the changes from the patch so things would work, I couldn't run direct from my laptop as the ZTP script only adds the public ssh key for the 'homer' user.

For some reason it failed on lsw1-d4-codfw, config generated looked fine (correct mgmt IP same as in Netbox and matching dns etc), but after trying to commit the device became unreachable. Last lines of attempt to config as follows:

<---previous output cut--->
+  switch-options {
+      vtep-source-interface lo0.0;
+      route-distinguisher 10.192.252.29:64811;
+      vrf-import Evpn_rt_import;
+      vrf-target target:64811:9999;
+  }
-  vlans {
-      default {
-          vlan-id 1;
-          l3-interface irb.0;
-      }
-  }

Type "yes" to commit, "no" to abort.
> yes
INFO:homer.transports.junos:Committing the configuration on lsw1-d4-codfw.mgmt.codfw.wmnet
WARNING:homer.transports.junos:Unable to close the connection to the device: RpcTimeoutError(host: lsw1-d4-codfw.mgmt.codfw.wmnet, cmd: close, timeout: 60)
ERROR:homer:Attempt 1/3 failed: RpcTimeoutError(host: lsw1-d4-codfw.mgmt.codfw.wmnet, cmd: commit-configuration, timeout: 60)
ERROR:homer:Attempt 2/3 failed: Unable to connect to lsw1-d4-codfw.mgmt.codfw.wmnet
ERROR:homer:Attempt 3/3 failed: Unable to connect to lsw1-d4-codfw.mgmt.codfw.wmnet

@Papaul we'll probably need to get the serial connection working for lsw1-d4-codfw to rescue it / try and work out what happened.

On the others it worked fine and config commited, I can ssh straight from my laptop:

cmooney@wikilap:~$ ssh lsw1-d2-codfw.mgmt
Last login: Fri May 31 10:41:12 2024 from 208.80.153.110
--- JUNOS 22.2R3.15 Kernel 64-bit  JNPR-12.1-20230303.4e45fe64_bui
{master:0}
cmooney@lsw1-d2-codfw> 
cmooney@lsw1-d2-codfw> show bgp summary 

Warning: License key missing; requires 'bgp' license

Threading mode: BGP I/O
Default eBGP mode: advertise - accept, receive - accept
Groups: 1 Peers: 2 Down peers: 2
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
bgp.evpn.0           
                       0          0          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
10.192.252.18         64811          0          0       0       0       18:37 Active
10.192.252.19         64811          0          0       0       0       18:37 Active

I'll need to get the licences added. One good thing also is the lack of TLS cert on the system (as we can't run the cookbook to create it), doesn't cause a problem applying the config, even though it adds config for grpc that refers to a cert.

@Papaul @Jhancock.wm I noticed that the leaf in rack d8 is reporting one of it's power supplys down:

cmooney@lsw1-d8-codfw> show system alarms 
1 alarms currently active
Alarm time               Class  Description
2024-05-31 02:56:19 UTC  Major  FPC0: PEM 0 Not Powered

If you could take a look next time on site that'd be great thanks.

Change #1037651 merged by jenkins-bot:

[operations/homer/public@master] Add lsw1-c1 to devices.yaml to test homer after ZTP

https://gerrit.wikimedia.org/r/1037651

Reseated it. Should be good now.

@cmooney all good on lsw1-d4, lsw1-c2 and lsw1-d8

@cmooney all good on lsw1-d4, lsw1-c2 and lsw1-d8

Thanks!

Confirmed all looks good. What was the situation with lsw1-d4-codfw? I pushed the full config again just now and it worked fine (also did the came for c2).

I tried to bring the ssw1-d8-codfw link to cr2-codfw up, but it doesn't look ready? Ticked above, but checking on cr2-codfw there is no QSFP in 1/0/2 after the bounce. Is the cabling in place for that? Still set to planned in Netbox so maybe I jumped the gun:

https://netbox.wikimedia.org/dcim/cables/7912/