Page MenuHomePhabricator

codfw: setup MPC10E-10C and SCBE3
Closed, ResolvedPublic

Description

Ok I beleive these are the steps we want to follow:

PHASE 1: Drain cr2-codfw of traffic

1. Depool codfw in DNS

We could maybe skip this, it doesn't remove the requirement to keep the site online, but it does maybe help if something goes wrong.

sudo cookbook sre.dns.admin -t T393552 -r "being cautious during maintenance on cr2-codfw" depool codfw

2. Save new recovery snapshots on cr2-codfw

request system snapshot re0
request system snapshot re1

3. Drain cr2-codfw transport circuits by increasing OSPF cost

We set the status of each of the following to "drained" in Netbox:

https://netbox.wikimedia.org/circuits/circuits/29/
https://netbox.wikimedia.org/circuits/circuits/103/
https://netbox.wikimedia.org/circuits/circuits/50/

Then run Homer against the following to apply the updated metrics:

cr2-codfw
cr2-eqord
cr2-eqiad
cr2-eqdfw

4. Drain cr2-codfw links controlled by BGP with graceful-shutdown config

set protocols bgp graceful-shutdown sender

This will reduce local-pref of all routes learnt, and add the graceful-shutdown sender community to routes announced so peers don't send traffic to the router.

5. Watch traffic through cr2-codfw on Grafana/LibreNMS panels

It should drain to basically zero (save a little control-plane activity)

https://grafana.wikimedia.org/goto/N7jjyCbHR

6. Add downtime for cr2-codfw and connected devices:

sudo cookbook sre.hosts.downtime --hours 3 -t T393552 -r "replace cr2-codfw switch control boards and install new line card" --force "cr2-codfw, cr2-codfw IPv6, re0.cr2-codfw.mgmt, cr1-codfw, cr2-eqord, cr2-eqiad, cr2-eqdfw, ssw1-d8-codfw.mgmt, ssw1-a8-codfw.mgmt, pfw1-codfw, cloudsw1-b1-codfw.mgmt"

7. Shut down cr2-codfw external BGP groups manually

Not strictly required as graceful-shutdown mostly handles it, but especially for peering sessions it's probably needed as all may not support the graceful-shutdown community. We should have very little remaining traffic on the box before we do this, and especially no traffic from cr1-codfw (i.e. CR1 does not prefer any routes from CR2).

set protocols bgp group Transit4 shutdown
set protocols bgp group Transit6 shutdown
set protocols bgp group Private-Peer4 shutdown
set protocols bgp group Private-Peer6 shutdown
set protocols bgp group IX4 shutdown
set protocols bgp group IX6 shutdown
set protocols bgp group Cloud shutdown
set protocols bgp group Switch shutdown
PHASE 2: Remove RE1 and SCB from cr2-codfw

8. Deactivate graceful-switchover / redundancy between the REs

deactivate chassis redundancy graceful-switchover

9. Save a new rescue config with graceful-switchover disabled

request system configuration rescue save

10. Shut down the backup RE (RE1)

request vmhost halt re1

Then check it shows as offline:

show chassis routing-engine

11. Take the SCB in device slot 1 offline

request chassis cb slot 1 offline

12. (DC-Ops) Remove RE1 from the chassis.

This is the RE in slot 1. Neither the 'ONLINE' or 'MASTER' LEDs should be lit on the card. Proceed to safely remove it from the SCB card as described here:

https://www.juniper.net/documentation/us/en/hardware/mx480/topics/topic-map/mx480-maintain-host-subsystem.html#id-replacing-an-mx480-routing-engine

13. (DC-Ops) Remove the SCB1 card from the chassis

The (now-empty) SCB card can be removed from the chassis:

https://www.juniper.net/documentation/us/en/hardware/mx480/topics/topic-map/mx480-maintain-scbs.html

PHASE 3: Add new SCBE3-MX card and re-add RE1 on cr2-codfw

14. (DC-Ops) Insert SCBE3-MX into newly emptied slot 1

Insert one of the new SCBE3-MX cards into the newly emptied slot

15. (DC-Ops) Insert the routing-engine into the new SCBE3-MX card

Place the RE that was removed in the previous step into the newly installed control board

16. Bring RE1 back online and check status

request vmhost power-on other-routing-engine
show chassis routing-engine
PHASE 4: Remove RE0 and SCB from cr2-codfw

17. Make RE1 the master routing engine

This is a potentially risky operation, if somehow the switch doesn't work or state isn't fully synced.

The router is not in the traffic path so we don't have to overly worry

request chassis routing-engine master switch
show chassis routing-engine

18. Take RE0 offline

request vmhost halt re0

19. Take SCB card in slot 0 offline

request chassis cb slot 0 offline

20. (DC-Ops) Remove RE0 from the chassis.

This is the RE in slot 0. Neither the 'ONLINE' or 'MASTER' LEDs should be lit on the card. Proceed to safely remove it from the SCB card as described here:

https://www.juniper.net/documentation/us/en/hardware/mx480/topics/topic-map/mx480-maintain-host-subsystem.html#id-replacing-an-mx480-routing-engine

21. (DC-Ops) Remove the SCB0 card from the chassis

The (now-empty) SCB card can be removed from the chassis:

https://www.juniper.net/documentation/us/en/hardware/mx480/topics/topic-map/mx480-maintain-scbs.html

PHASE 5: Add new SCBE3-MX card and re-add RE0 in cr2-codfw

22. (DC-Ops) Insert SCBE3-MX into newly emptied slot 0

Insert one of the new SCBE3-MX cards into the newly emptied slot

23. (DC-Ops) Insert the routing-engine into the new SCBE3-MX card

Place the RE that was removed in the previous step into the newly installed control board

24. Bring RE0 back online and check status

request vmhost power-on other-routing-engine
show chassis routing-engine

25. Make RE0 the master routing engine again

request chassis routing-engine master switch
show chassis routing-engine

26. Re-enable graceful failover

activate chassis redundancy graceful-switchover
request system configuration rescue save
PHASE 6: Install new MPC10E-10C card in cr2-codfw

27. (DC-Ops) Install new MPC10E-10C card in cr2-codfw slot 0

Install the new MPC10E-10C card into the first empty slot on the device (third up from bottom, above the RE1 card)

https://www.juniper.net/documentation/us/en/hardware/mx480/topics/topic-map/mx480-maintain-interface-modules.html#id-replacing-an-mx480-mpc

Once installed we need to check it is detected in the system

show chassis hardware
show chassis fpc 0 detail

We then need to bring it online:

request chassis fpc online slot 0
show chassis fpc 0 detail

At this point we need to add the license for the card to the system. I believe this should only be a matter of activating the license (using code in mail from Rob) on the Juniper portal, then adding with set system license keys key. There are no existing software licenses on the device as all existing hardware was under the old system. Will also need to check our automation for this will work ok. It's an honour system so this should not be a show-stopper but hard to be 100% sure until we have things in the device.

PHASE 7: Move connections from MPC7E card to MPC10E-10C

28. Configure new card for correct line speeds

To make this easier we re-use the numbering fromm the MPC7E card where possible.

set chassis fpc 0 pic 0 port 1 number-of-sub-ports 4
set chassis fpc 0 pic 0 port 1 speed 10g
set chassis fpc 0 pic 0 port 2 speed 100g
set chassis fpc 0 pic 0 port 4 number-of-sub-ports 4
set chassis fpc 0 pic 0 port 4 speed 10g
set chassis fpc 0 pic 1 port 0 speed 40g
set chassis fpc 0 pic 1 port 1 number-of-sub-ports 4
set chassis fpc 0 pic 1 port 1 speed 10g
set chassis fpc 0 pic 1 port 2 speed 100g
set chassis fpc 0 pic 1 port 3 speed 40g

29. (DC-Ops) Move physical fibres from old card to new one at a time

We need to move each of these links. The basic proceedure should be:

  • Move the fibre/optic module from one line card to the other
  • Rename the old interface(s) in Netbox to match the new port
  • Create a new, disabled interface of the right type in Netbox with the old name
  • Run Homer to move the configuration from one port to the other
  • Validate the link comes up, and we can ping the other side
Old PortNew PortDescription
1/0/10/0/14x10G SM Breakout to first 4 ports on codfw_a8_smf_panel_1 module 1
1/0/20/0/2100G-SR4 multicore/MPO connection to ssw1-d8-codfw
1/0/40/0/44x10G SM Breakout with direct connections to cloudsw1-b1 and ssw3-a8
1/1/0----There is a 40GBase-SR4 QSFP+ module in this port which is unused, it should be moved to spares
1/1/10/1/14x10G SM Breakout to first 4 ports on codfw_a8_smf_panel_1 module 2
1/1/3----There is a 40GBase-SR4 QSFP+ module in this port which is unused, it should be moved to spares
1/1/50/1/2100G-SR4 multicore/MPO connection to ssw1-a8-codfw

30. (DC-Ops) Remove FPC 1 (MPC7E) from slot 1 of cr2-codfw

All the interfaces should be in "DISABLED" mode by now on the card.

request chassis fpc offline slot 1
show show chassis fpc detail 1

In configuration mode:

set chassis fpc 1 power off
commit
PHASE 8: Bring CR2 back into traffic path

31. Re-enable cr2-codfw external BGP groups

These should be done one at a time, and we need to check that sessions come back up afterwards.

For IBGP and transit sessions we need to wait until full tables are received and convergence takes place. We also need to be patient with the peering ones given the number of peers in each.

activate protocols bgp group Transit4
activate protocols bgp group Transit6
activate protocols bgp group Private-Peer4
activate protocols bgp group Private-Peer6
activate protocols bgp group IX4
activate protocols bgp group IX6
activate protocols bgp group Cloud
activate protocols bgp group Switch

32. Un-drain transport circuits via CR2 active path for traffic again

Set these back to 'default' in Netbox:

https://netbox.wikimedia.org/circuits/circuits/29/
https://netbox.wikimedia.org/circuits/circuits/103/
https://netbox.wikimedia.org/circuits/circuits/50/

Then run Homer against the following to apply the updated metrics:

cr2-codfw
cr2-eqord
cr2-eqiad
cr2-eqdfw

33. Remove the "graceful-shutdown" command to set traffic back to

delete protocols bgp graceful-shutdown sender

34. Wait for traffic levels to return to where they had been looking at graphs

https://grafana.wikimedia.org/goto/N7jjyCbHR

35. Save new snapshots and rescue configuration on cr2-codfw

request system snapshot re0
request system snapshot re1
request system configuration rescue save
PHASE 9: Move MPC7E card from cr2-codfw slot 1 to cr1-codfw slot 3

36. (DC-Ops) Install MPC7E removed from CR2 into slot 3 of cr1-codfw

Do we need to drain this router of traffic before doing so?

37. Save new snapshots and resucue configuration on cr1-codfw

request system snapshot re0
request system snapshot re1
request system configuration rescue save
PHASE 10: Follow on actions

38. Modify Netbox inventory items to match the new setup

We need to adjust the inventory items for both routers, adding the new card on cr2-codfw while removing the old, then adding the moved one on cr1-codfw.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
cmooney triaged this task as Medium priority.May 9 2025, 6:13 PM
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

All the steps looks good to me thanks.

Mentioned in SAL (#wikimedia-operations) [2025-05-20T11:22:58Z] <cmooney@cumin1003> START - Cookbook sre.dns.admin DNS admin: depool site codfw [reason: being cautious during maintenance on codfw CRs, T393552]

Mentioned in SAL (#wikimedia-operations) [2025-05-20T11:25:23Z] <cmooney@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site codfw [reason: being cautious during maintenance on codfw CRs, T393552]

Icinga downtime and Alertmanager silence (ID=5afc68ed-eba5-4a71-b833-f809ae58201b) set by cmooney@cumin1003 for 4:00:00 on 11 host(s) and their services with reason: replace cr2-codfw switch control boards and install new line card

cloudsw1-b1-codfw.mgmt,cr[1-2]-codfw,cr2-codfw IPv6,cr2-eqdfw,cr2-eqiad,cr2-eqord,pfw1-codfw,re0.cr2-codfw.mgmt,ssw1-a8-codfw.mgmt,ssw1-d8-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2025-05-20T14:52:14Z] <topranks> shutting down backup RE1 on cr2-codfw (T393552)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T14:53:39Z] <topranks> shutting down control board 1 on cr2-codfw (T393552)

Icinga downtime and Alertmanager silence (ID=e24daea6-0330-4b79-bf33-b9e0f9709a10) set by cmooney@cumin1003 for 2:00:00 on 11 host(s) and their services with reason: replace cr2-codfw switch control boards and install new line card

cloudsw1-b1-codfw.mgmt,cr[1-2]-codfw,cr2-codfw IPv6,cr2-eqdfw,cr2-eqiad,cr2-eqord,pfw1-codfw,re0.cr2-codfw.mgmt,ssw1-a8-codfw.mgmt,ssw1-d8-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2025-05-20T17:21:08Z] <topranks> enable FPC 0 (10x100G) card in cr2-codfw (T393552)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T17:45:01Z] <topranks> moving links from old to new linecard cr2-codfw slot 1 to slot 0 T393552

Mentioned in SAL (#wikimedia-operations) [2025-05-20T18:21:14Z] <cmooney@cumin1003> START - Cookbook sre.dns.admin DNS admin: pool site codfw [reason: repool codfw after core router maintenance, T393552]

Mentioned in SAL (#wikimedia-operations) [2025-05-20T18:21:18Z] <cmooney@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site codfw [reason: repool codfw after core router maintenance, T393552]

Mentioned in SAL (#wikimedia-operations) [2025-05-20T18:21:39Z] <topranks> repool codfw in dns after core router maintenance T393552

cmooney claimed this task.
cmooney added subscribers: Jhancock.wm, cmooney.

Ok this is now complete. A few niggles along the way that were sorted out with multiple re-seats of SCB/RE/RE disks but we got it over the line in the end.

Big thanks to @Papaul and @Jhancock.wm on site for the work!!

Actually there are a few bits like the license and the inventory items in Netbox to be completed which I'll take of in the morning.

License is now applied and inventory items updated for cr1-codfw and cr2-codfw.