Ok I beleive these are the steps we want to follow:
PHASE 1: Drain cr2-codfw of traffic
1. Depool codfw in DNS
We could maybe skip this, it doesn't remove the requirement to keep the site online, but it does maybe help if something goes wrong.
sudo cookbook sre.dns.admin -t T393552 -r "being cautious during maintenance on cr2-codfw" depool codfw
2. Save new recovery snapshots on cr2-codfw
request system snapshot re0 request system snapshot re1
3. Drain cr2-codfw transport circuits by increasing OSPF cost
We set the status of each of the following to "drained" in Netbox:
https://netbox.wikimedia.org/circuits/circuits/29/
https://netbox.wikimedia.org/circuits/circuits/103/
https://netbox.wikimedia.org/circuits/circuits/50/
Then run Homer against the following to apply the updated metrics:
cr2-codfw cr2-eqord cr2-eqiad cr2-eqdfw
4. Drain cr2-codfw links controlled by BGP with graceful-shutdown config
set protocols bgp graceful-shutdown sender
This will reduce local-pref of all routes learnt, and add the graceful-shutdown sender community to routes announced so peers don't send traffic to the router.
5. Watch traffic through cr2-codfw on Grafana/LibreNMS panels
It should drain to basically zero (save a little control-plane activity)
https://grafana.wikimedia.org/goto/N7jjyCbHR
6. Add downtime for cr2-codfw and connected devices:
sudo cookbook sre.hosts.downtime --hours 3 -t T393552 -r "replace cr2-codfw switch control boards and install new line card" --force "cr2-codfw, cr2-codfw IPv6, re0.cr2-codfw.mgmt, cr1-codfw, cr2-eqord, cr2-eqiad, cr2-eqdfw, ssw1-d8-codfw.mgmt, ssw1-a8-codfw.mgmt, pfw1-codfw, cloudsw1-b1-codfw.mgmt"
7. Shut down cr2-codfw external BGP groups manually
Not strictly required as graceful-shutdown mostly handles it, but especially for peering sessions it's probably needed as all may not support the graceful-shutdown community. We should have very little remaining traffic on the box before we do this, and especially no traffic from cr1-codfw (i.e. CR1 does not prefer any routes from CR2).
set protocols bgp group Transit4 shutdown set protocols bgp group Transit6 shutdown set protocols bgp group Private-Peer4 shutdown set protocols bgp group Private-Peer6 shutdown set protocols bgp group IX4 shutdown set protocols bgp group IX6 shutdown set protocols bgp group Cloud shutdown set protocols bgp group Switch shutdown
PHASE 2: Remove RE1 and SCB from cr2-codfw
8. Deactivate graceful-switchover / redundancy between the REs
deactivate chassis redundancy graceful-switchover
9. Save a new rescue config with graceful-switchover disabled
request system configuration rescue save
10. Shut down the backup RE (RE1)
request vmhost halt re1
Then check it shows as offline:
show chassis routing-engine
11. Take the SCB in device slot 1 offline
request chassis cb slot 1 offline
12. (DC-Ops) Remove RE1 from the chassis.
This is the RE in slot 1. Neither the 'ONLINE' or 'MASTER' LEDs should be lit on the card. Proceed to safely remove it from the SCB card as described here:
13. (DC-Ops) Remove the SCB1 card from the chassis
The (now-empty) SCB card can be removed from the chassis:
https://www.juniper.net/documentation/us/en/hardware/mx480/topics/topic-map/mx480-maintain-scbs.html
PHASE 3: Add new SCBE3-MX card and re-add RE1 on cr2-codfw
14. (DC-Ops) Insert SCBE3-MX into newly emptied slot 1
Insert one of the new SCBE3-MX cards into the newly emptied slot
15. (DC-Ops) Insert the routing-engine into the new SCBE3-MX card
Place the RE that was removed in the previous step into the newly installed control board
16. Bring RE1 back online and check status
request vmhost power-on other-routing-engine show chassis routing-engine
PHASE 4: Remove RE0 and SCB from cr2-codfw
17. Make RE1 the master routing engine
This is a potentially risky operation, if somehow the switch doesn't work or state isn't fully synced.
The router is not in the traffic path so we don't have to overly worry
request chassis routing-engine master switch show chassis routing-engine
18. Take RE0 offline
request vmhost halt re0
19. Take SCB card in slot 0 offline
request chassis cb slot 0 offline
20. (DC-Ops) Remove RE0 from the chassis.
This is the RE in slot 0. Neither the 'ONLINE' or 'MASTER' LEDs should be lit on the card. Proceed to safely remove it from the SCB card as described here:
21. (DC-Ops) Remove the SCB0 card from the chassis
The (now-empty) SCB card can be removed from the chassis:
https://www.juniper.net/documentation/us/en/hardware/mx480/topics/topic-map/mx480-maintain-scbs.html
PHASE 5: Add new SCBE3-MX card and re-add RE0 in cr2-codfw
22. (DC-Ops) Insert SCBE3-MX into newly emptied slot 0
Insert one of the new SCBE3-MX cards into the newly emptied slot
23. (DC-Ops) Insert the routing-engine into the new SCBE3-MX card
Place the RE that was removed in the previous step into the newly installed control board
24. Bring RE0 back online and check status
request vmhost power-on other-routing-engine show chassis routing-engine
25. Make RE0 the master routing engine again
request chassis routing-engine master switch show chassis routing-engine
26. Re-enable graceful failover
activate chassis redundancy graceful-switchover request system configuration rescue save
PHASE 6: Install new MPC10E-10C card in cr2-codfw
27. (DC-Ops) Install new MPC10E-10C card in cr2-codfw slot 0
Install the new MPC10E-10C card into the first empty slot on the device (third up from bottom, above the RE1 card)
Once installed we need to check it is detected in the system
show chassis hardware show chassis fpc 0 detail
We then need to bring it online:
request chassis fpc online slot 0 show chassis fpc 0 detail
At this point we need to add the license for the card to the system. I believe this should only be a matter of activating the license (using code in mail from Rob) on the Juniper portal, then adding with set system license keys key. There are no existing software licenses on the device as all existing hardware was under the old system. Will also need to check our automation for this will work ok. It's an honour system so this should not be a show-stopper but hard to be 100% sure until we have things in the device.
PHASE 7: Move connections from MPC7E card to MPC10E-10C
28. Configure new card for correct line speeds
To make this easier we re-use the numbering fromm the MPC7E card where possible.
set chassis fpc 0 pic 0 port 1 number-of-sub-ports 4 set chassis fpc 0 pic 0 port 1 speed 10g set chassis fpc 0 pic 0 port 2 speed 100g set chassis fpc 0 pic 0 port 4 number-of-sub-ports 4 set chassis fpc 0 pic 0 port 4 speed 10g set chassis fpc 0 pic 1 port 0 speed 40g set chassis fpc 0 pic 1 port 1 number-of-sub-ports 4 set chassis fpc 0 pic 1 port 1 speed 10g set chassis fpc 0 pic 1 port 2 speed 100g set chassis fpc 0 pic 1 port 3 speed 40g
29. (DC-Ops) Move physical fibres from old card to new one at a time
We need to move each of these links. The basic proceedure should be:
- Move the fibre/optic module from one line card to the other
- Rename the old interface(s) in Netbox to match the new port
- Create a new, disabled interface of the right type in Netbox with the old name
- Run Homer to move the configuration from one port to the other
- Validate the link comes up, and we can ping the other side
| Old Port | New Port | Description |
|---|---|---|
| 1/0/1 | 0/0/1 | 4x10G SM Breakout to first 4 ports on codfw_a8_smf_panel_1 module 1 |
| 1/0/2 | 0/0/2 | 100G-SR4 multicore/MPO connection to ssw1-d8-codfw |
| 1/0/4 | 0/0/4 | 4x10G SM Breakout with direct connections to cloudsw1-b1 and ssw3-a8 |
| 1/1/0 | ---- | There is a 40GBase-SR4 QSFP+ module in this port which is unused, it should be moved to spares |
| 1/1/1 | 0/1/1 | 4x10G SM Breakout to first 4 ports on codfw_a8_smf_panel_1 module 2 |
| 1/1/3 | ---- | There is a 40GBase-SR4 QSFP+ module in this port which is unused, it should be moved to spares |
| 1/1/5 | 0/1/2 | 100G-SR4 multicore/MPO connection to ssw1-a8-codfw |
30. (DC-Ops) Remove FPC 1 (MPC7E) from slot 1 of cr2-codfw
All the interfaces should be in "DISABLED" mode by now on the card.
request chassis fpc offline slot 1 show show chassis fpc detail 1
In configuration mode:
set chassis fpc 1 power off commit
PHASE 8: Bring CR2 back into traffic path
31. Re-enable cr2-codfw external BGP groups
These should be done one at a time, and we need to check that sessions come back up afterwards.
For IBGP and transit sessions we need to wait until full tables are received and convergence takes place. We also need to be patient with the peering ones given the number of peers in each.
activate protocols bgp group Transit4 activate protocols bgp group Transit6 activate protocols bgp group Private-Peer4 activate protocols bgp group Private-Peer6 activate protocols bgp group IX4 activate protocols bgp group IX6 activate protocols bgp group Cloud activate protocols bgp group Switch
32. Un-drain transport circuits via CR2 active path for traffic again
Set these back to 'default' in Netbox:
https://netbox.wikimedia.org/circuits/circuits/29/
https://netbox.wikimedia.org/circuits/circuits/103/
https://netbox.wikimedia.org/circuits/circuits/50/
Then run Homer against the following to apply the updated metrics:
cr2-codfw cr2-eqord cr2-eqiad cr2-eqdfw
33. Remove the "graceful-shutdown" command to set traffic back to
delete protocols bgp graceful-shutdown sender
34. Wait for traffic levels to return to where they had been looking at graphs
https://grafana.wikimedia.org/goto/N7jjyCbHR
35. Save new snapshots and rescue configuration on cr2-codfw
request system snapshot re0 request system snapshot re1 request system configuration rescue save
PHASE 9: Move MPC7E card from cr2-codfw slot 1 to cr1-codfw slot 3
36. (DC-Ops) Install MPC7E removed from CR2 into slot 3 of cr1-codfw
Do we need to drain this router of traffic before doing so?
37. Save new snapshots and resucue configuration on cr1-codfw
request system snapshot re0 request system snapshot re1 request system configuration rescue save
PHASE 10: Follow on actions
38. Modify Netbox inventory items to match the new setup
We need to adjust the inventory items for both routers, adding the new card on cr2-codfw while removing the old, then adding the moved one on cr1-codfw.