Page MenuHomePhabricator
Paste P64182

Spine switch eqiad upgrade steps
ActivePublic

Authored by cmooney on Thu, Jun 6, 11:30 AM.
Tags
None
Referenced Files
F54965086: Spine switch eqiad upgrade steps
Thu, Jun 6, 11:35 AM
F54964969: Spine switch eqiad upgrade steps
Thu, Jun 6, 11:30 AM
Subscribers
None
### LVS1017 cable move to leaf
1) Downtime lvs1017
sudo cookbook sre.hosts.downtime --minutes 30 -r "moving lvs1017 link to row E from spine to leaf" -t T366361 lvs1017.eqiad.wmnet
2) Disable Pybal on lvs1017
sudo systemctl stop pybal
3) Check that the traffic moves to lvs1020
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
4) DC-Ops move first cable
ssw1-e1-eqiad xe-0/0/32 --> lsw1-e1-eqiad xe-0/0/8
5) Validate that vlans are reachable on the lvs1017
lvs1017: ./check_vlans.sh
6) Re-enable PyBal on lvs1017
sudo systemctl start pybal
7) Check connections move back to lvs1017
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
8) Remove downtime for lvs1017
sudo sre.hosts.remove-downtime 'lvs1017.eqiad.wmnet'
### LVS1018 cable move to leaf
1) Downtime lvs1018
sudo cookbook sre.hosts.downtime --minutes 30 -r "moving lvs1018 link to row E from spine to leaf" -t T366361 lvs1018.eqiad.wmnet
2) Disable Pybal on lvs1018
sudo systemctl stop pybal
3) Check that the traffic moves to lvs1020
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
4) DC-Ops move first cable
ssw1-e1-eqiad xe-0/0/33 --> lsw1-e1-eqiad xe-0/0/9
5) Validate that vlans are reachable on the lvs1018
lvs1018: ./check_vlans.sh
6) Re-enable PyBal on lvs1018
sudo systemctl start pybal
7) Check connections move back to lvs1018
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
8) Remove downtime for lvs1018
sudo sre.hosts.remove-downtime 'lvs1018.eqiad.wmnet'
### LVS1019 cable move to leaf
1) Downtime lvs1019
sudo cookbook sre.hosts.downtime --minutes 30 -r "moving lvs1019 link to row F from spine to leaf" -t T366361 lvs1019.eqiad.wmnet
2) Disable Pybal on lvs1019
sudo systemctl stop pybal
3) Check that the traffic moves to lvs1020
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
4) DC-Ops move first cable
ssw1-f1-eqiad xe-0/0/32 --> lsw1-f1-eqiad xe-0/0/8
5) Validate that vlans are reachable on the lvs1019
lvs1019: ./check_vlans.sh
6) Re-enable PyBal on lvs1019
sudo systemctl start pybal
7) Check connections move back to lvs1019
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
8) Remove downtime for lvs1019
sudo sre.hosts.remove-downtime 'lvs1019.eqiad.wmnet'
## SSW1-E1-EQIAD UPGRADE
1) Downtime EVPN devices in eqiad
sudo cookbook sre.hosts.downtime --minutes 90 -r "upgrading spine switches eqiad rows e and f" -t T366361 --force "lsw1-e1-eqiad.mgmt, lsw1-e2-eqiad.mgmt, lsw1-e3-eqiad.mgmt, lsw1-e5-eqiad.mgmt, lsw1-e6-eqiad.mgmt, lsw1-e7-eqiad.mgmt, lsw1-f1-eqiad.mgmt, lsw1-f2-eqiad.mgmt, lsw1-f3-eqiad.mgmt, lsw1-f5-eqiad.mgmt, lsw1-f6-eqiad.mgmt, lsw1-f7-eqiad.mgmt, ssw1-e1-eqiad, ssw1-e1-eqiad IPv6, ssw1-e1-eqiad.mgmt, ssw1-f1-eqiad, ssw1-f1-eqiad IPv6, ssw1-f1-eqiad.mgmt"
2) Disable BGP on cr1-eqiad towards ssw1-e1-eqiad
deactivate protocols bgp group Switch neighbor 10.66.0.9 description ssw1-e1-eqiad
deactivate protocols bgp group Switch neighbor 2620:0:861:fe07::2 description ssw1-e1-eqiad
2) Validate connectivity looks ok towards a server in a rack
cmooney@prometheus1005:~$ sudo traceroute -I 10.64.153.1
traceroute to 10.64.153.1 (10.64.153.1), 30 hops max, 60 byte packets
1 ae1-1017.cr1-eqiad.wikimedia.org (10.64.0.2) 2.646 ms 2.608 ms 2.620 ms
2 et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9) 10.600 ms 10.624 ms 10.616 ms
3 irb-1048.lsw1-e5-eqiad.eqiad.wmnet (10.64.153.1) 5.384 ms 5.376 ms 5.366 ms
4) Connect to ssw1-e1-eqiad over serial console
ssh scs-f8-eqiad.mgmt - port 29
5) Reload box with new software version
request system software add /var/tmp//var/tmp/jinstall-host-qfx-5e-x86-64-22.2R3.15-secure-signed.tgz reboot
6) Check thigns look ok on reboot
show ospf interface
show bgp sum
show route protocol evpn terse table PRODUCTION.evpn.0 | match " 5:"
show ethernet-switching mac-learning-log
show system alarms
request system configuration rescue save
request system storage cleanup
7) Re-enable BGP on cr1-eqiad towards ssw1-e1-eqiad
activate protocols bgp group Switch neighbor 10.66.0.9 description ssw1-e1-eqiad
activate protocols bgp group Switch neighbor 2620:0:861:fe07::2 description ssw1-e1-eqiad
8) Check traffic flow is good again:
cmooney@prometheus1005:~$ sudo traceroute -I 10.64.153.1
traceroute to 10.64.153.1 (10.64.153.1), 30 hops max, 60 byte packets
1 ae1-1017.cr1-eqiad.wikimedia.org (10.64.0.2) 2.646 ms 2.608 ms 2.620 ms
2 et-0-0-31-100.ssw1-e1-eqiad.eqiad.wmnet (10.66.0.9) 10.600 ms 10.624 ms 10.616 ms
3 irb-1048.lsw1-e5-eqiad.eqiad.wmnet (10.64.153.1) 5.384 ms 5.376 ms 5.366 ms
### SSW1-F1-EQIAD UPGRADE
1) Disable BGP on cr2-eqiad towards ssw1-f1-eqiad
deactivate protocols bgp group Switch neighbor 10.66.0.11 description ssw1-f1-eqiad
deactivate protocols bgp group Switch neighbor 2620:0:861:fe08::2 description ssw1-f1-eqiad
2) Validate connectivity looks ok towards a server in a rack
cmooney@cumin1002:~$ sudo traceroute -I 10.64.149.1 -w 1
traceroute to 10.64.149.1 (10.64.149.1), 30 hops max, 60 byte packets
1 ae4-1020.cr2-eqiad.wikimedia.org (10.64.48.3) 0.563 ms 0.528 ms 0.520 ms
2 xe-0-0-0-1100.cloudsw1-d5-eqiad.eqiad.wmnet (10.64.147.15) 14.846 ms 14.840 ms 14.832 ms
3 irb-1124.cloudsw1-f4-eqiad.eqiad.wmnet (10.64.149.1) 1.482 ms 1.475 ms 1.469 ms
3) Connect to ssw1-f1-eqiad over serial console
ssh scs-f8-eqiad.mgmt - port 20
4) Reload box with new software version
request system software add /var/tmp//var/tmp/jinstall-host-qfx-5e-x86-64-22.2R3.15-secure-signed.tgz reboot
5) Check thigns look ok on reboot
show ospf interface
show bgp sum
show route protocol evpn terse table PRODUCTION.evpn.0 | match " 5:"
show ethernet-switching mac-learning-log
show system alarms
request system configuration rescue save
request system storage cleanup
6) Re-enable BGP on cr2-eqiad towards ssw1-f1-eqiad
activate protocols bgp group Switch neighbor 10.66.0.11 description ssw1-f1-eqiad
activate protocols bgp group Switch neighbor 2620:0:861:fe08::2 description ssw1-f1-eqiad
7) Validate connectivity looks ok towards a server in a rack
cmooney@cumin1002:~$ sudo traceroute -I 10.64.149.1 -w 1
traceroute to 10.64.149.1 (10.64.149.1), 30 hops max, 60 byte packets
1 ae4-1020.cr2-eqiad.wikimedia.org (10.64.48.3) 0.563 ms 0.528 ms 0.520 ms
2 xe-0-0-0-1100.cloudsw1-d5-eqiad.eqiad.wmnet (10.64.147.15) 14.846 ms 14.840 ms 14.832 ms
3 irb-1124.cloudsw1-f4-eqiad.eqiad.wmnet (10.64.149.1) 1.482 ms 1.475 ms 1.469 ms
### LVS1017 Move Back
1) Downtime lvs1017
sudo cookbook sre.hosts.downtime --minutes 30 -r "moving lvs1017 link to row E back to spine" -t T366361 lvs1017.eqiad.wmnet
2) Disable Pybal on lvs1017
sudo systemctl stop pybal
3) Check that the traffic moves to lvs1020
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
4) DC-Ops move back cable
lsw1-e1-eqiad xe-0/0/8 --> ssw1-e1-eqiad xe-0/0/32
5) Validate that vlans are reachable on the lvs1017
lvs1017: ./check_vlans.sh
6) Re-enable PyBal on lvs1017
sudo systemctl start pybal
7) Check connections move back to lvs1017
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
8) Remove downtime for lvs1017
sudo sre.hosts.remove-downtime 'lvs1017.eqiad.wmnet'
### LVS1018 Move back
1) Downtime lvs1018
sudo cookbook sre.hosts.downtime --minutes 30 -r "moving lvs1018 link to row E back to spine" -t T366361 lvs1018.eqiad.wmnet
2) Disable Pybal on lvs1018
sudo systemctl stop pybal
3) Check that the traffic moves to lvs1020
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
4) DC-Ops move back cable
lsw1-e1-eqiad xe-0/0/9 --> ssw1-e1-eqiad xe-0/0/33
5) Validate that vlans are reachable on the lvs1018
lvs1018: ./check_vlans.sh
6) Re-enable PyBal on lvs1018
sudo systemctl start pybal
7) Check connections move back to lvs1018
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
8) Remove downtime for lvs1018
sudo sre.hosts.remove-downtime 'lvs1018.eqiad.wmnet'
### LVS1019 Move back
1) Downtime lvs1019
sudo cookbook sre.hosts.downtime --minutes 30 -r "moving lvs1019 link to row F back to spine" -t T366361 lvs1019.eqiad.wmnet
2) Disable Pybal on lvs1019
sudo systemctl stop pybal
3) Check that the traffic moves to lvs1020
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
4) DC-Ops move back cable
lsw1-f1-eqiad xe-0/0/8 --> ssw1-f1-eqiad xe-0/0/32
5) Validate that vlans are reachable on the lvs1019
lvs1019: ./check_vlans.sh
6) Re-enable PyBal on lvs1019
sudo systemctl start pybal
7) Check connections move back to lvs1019
https://grafana-rw.wikimedia.org/d/VhmZkr5nz/cathal-lvs-stacked
8) Remove downtime for lvs1019
sudo sre.hosts.remove-downtime 'lvs1019.eqiad.wmnet'
### TIDY UP TEMP PORT CONFIGURATION ON LEAF SWITCHES
cd ~/repos/random_wmf/netbox_scripts/spine_upgrade
./move_network_link.py -k <nb_key> move_back.txt
homer lsw*1-eqiad* commit "decom temp ports"