Change Details

Similar to T148506. This is about row B only: [x] Rack and cable the switches according to diagram (blocked on T187118) [Chris] {F11996449} [x] Connect mgmt/serial [Chris] [x] Check via serial that switches work, ports are configured as down [Arzhel] [x] Stack the switch, upgrade JunOS, initial switch configuration [Arzhel] [x] Add to DNS [Arzhel] [x] Add to LibreNMS & Rancid [Arzhel] [x] Switch ports configuration to match asw-b (+login announcement) [Arzhel] [x] Solve snowflakes [Chris/Arzhel] ``` WAS xe-3/1/0 description "labnet1001 eth5" MOVED TO: xe-2/0/22 WAS xe-3/1/2 description "labnet1001 eth4" MOVED TO: xe-2/0/24 WAS ge-3/0/33 description "labnet1001 eth0" MOVED TO: ge-2/0/23 WAS xe-4/1/0 description "labnet1002 eth3" MOVED TO: xe-4/0/45 WAS xe-4/1/2 description "labnet1002 eth4" MOVED TO: xe-4/0/44 ``` [x] Pre populate FPC2, FPC4 and FPC7 (QFX) with copper SFPs matching the current production servers on rack 2, 4 and 7 [Chris] [x] Add to Icinga [Arzhel] **Thursday 22nd, noon Eastern (4pm UTC) 3h (for all 3 rows)** [x] Verify cr2-eqiad is VRRP master [x] Disable interfaces from cr1-eqiad to asw-b [x] Move cr1 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel] ``` xe-2/0/44 -> cr1-eqiad:xe-3/0/1 xe-2/0/45 -> cr1-eqiad:xe-4/0/1 xe-7/0/44 -> cr1-eqiad:xe-4/1/1 xe-7/0/45 -> cr1-eqiad:xe-3/1/1 ``` [x] Connect asw2-b with asw-b with 2x10G (and document cable IDs if different) [Chris] ``` xe-2/0/43 -> asw-b-eqiad:xe-2/1/0 xe-7/0/43 -> asw-b-eqiad:xe-7/1/0 ``` [x] Verify traffic is properly flowing though asw2-b [x] Update interfaces descriptions on cr1 **Before maintenance** [] Failover hosts TBD? **In maintenance window April 10th (3pm UTC, 11am EDT, 8am PDT), 4h.** [] Downtime switch/hosts in Icinga [] Failover VRRP master to cr1 [] Verify traffic is properly flowing through cr1/asw2 [] Disable interface between cr2 and asw-b-eqiad:ae2 [] Move servers from asw-b to asw2-b [Chris] Servers I wasn't able to identify a owner: ``` iron <- experimental bastion, no special needs ``` @elukey ``` aqs1008 ``` @ArielGlenn ``` dumpsdata1001 ``` @Dzahn ``` phab1001 ``` @ssastry (ruthenium is the Parosid test server so should be OK) ``` ruthenium ``` @fgiunchedi These need to be depooled from LVS one at a time and then re-pooled ``` thumbor1001 thumbor1002 ``` @Eevans / @mobrovac Cassandra instances there need to be drained before switching the server off, cf. https://wikitech.wikimedia.org/wiki/Cassandra ``` restbase-dev1005 ``` Decommissioned ``` promethium bast1001 californium silver ``` @elukey ``` druid1005 analytics1046 analytics1047 analytics1048 analytics1049 analytics1050 analytics1051 analytics1061 analytics1062 analytics1063 analytics1072 analytics1073 kafka1002 kafka-jumbo1003 notebook1003 mc1024 mc1025 mc1026 mc1027 ``` DBs @jcrespo @Marostegui already aware ``` db1051 -> scheduled for decommissioning T195484 host down, can be ignored probably db1052 db1072 -> misc master, special care needed see T183585#4427995 db1073 -> misc master special care needed see T183585#4427995 db1076 db1077 db1083 db1084 db1085 db1086 db1098 db1099 db1104 db1112 db1113 dbproxy1004 -> passive dbproxy1005 -> passive dbproxy1006 -> passive es1013 es1014 ``` @Gehel elastic* should not be an issue T187962#4238825 ``` elastic1028 elastic1036 elastic1037 elastic1038 elastic1039 elastic1046 elastic1047 elastic1049 elastic1050 logstash1005 maps1002 wdqs1007 ``` @akosiaris relevant "poolcounter1001 Turns out we did not really need it after all. The sites survived the downtime" T187962#4241998 ``` poolcounter1002 kubernetes1002 kubestage1002 ores1003 ores1004 puppetmaster1001 rhodium wtp1031 wtp1032 wtp1033 wtp1034 wtp1035 ``` Traffic: @Vgutierrez @ema me lvs: Disable puppet/pybal, make sure healthy before proceeding to next host rdns? ``` lvs1001:eth1 lvs1002:eth1 lvs1003:eth1 lvs1004 lvs1005 lvs1006 chromium ripe atlas <- can ignore ``` @fgiunchedi "ms-be* to be moved one at a time, just a clean poweroff is enough, no depooling needed." ``` ms-be1016 ms-be1017 ms-be1018 ms-be1020 ms-be1022 ms-be1023 ms-be1031 ms-be1032 ms-be1034 ``` @fgiunchedi Needs to be depooled from LVS, clean shut down and then repooled ``` prometheus1004 ``` @joe / @elukey ``` mw1284 mw1285 mw1286 mw1287 mw1288 mw1289 mw1290 mw1293 mw1294 mw1295 mw1296 mw1297 mw1298 mw1299 mw1300 mw1301 mw1302 mw1303 mw1304 mw1305 mw1306 mw1313 mw1314 mw1315 mw1316 mw1317 mw1318 ``` @Joe (on vacations during window) relevant: "conf1002 is in row C (etcd connections will be interrupted, we know it can cause issues)." T187962#4241998 ``` conf1005 rdb1004 ``` Cloud: @chasemp (on vacations during window) @Andrew ``` labcontrol1004 labnodepool1001 labnodepool1002 labpuppetmaster1001 labvirt1001 eth0 labvirt1001 eth1 labvirt1002 eth0 labvirt1002 eth1 labvirt1003 eth0 labvirt1003 eth1 labvirt1004 eth0 labvirt1004 eth1 labvirt1005 eth0 labvirt1005 eth1 labvirt1006 eth0 labvirt1006 eth1 labvirt1010 eth0 labvirt1010 eth1 labvirt1011 eth0 labvirt1011 eth1 labvirt1012 eth0 labvirt1012 eth1 labvirt1013 eth0 labvirt1013 eth1 labvirt1014 eth0 labvirt1014 eth1 labvirt1015-eth0 labvirt1015-eth1 labvirt1016-eth0 labvirt1016-eth1 labvirt1017 labvirt1017-eth1 labvirt1018 labvirt1018-eth1 labweb1001 virt1010 eth0 virt1010 eth1 virt1011 eth0 virt1011 eth1 virt1012 eth0 virt1012 eth1 ``` [] Move cr2 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel] ``` xe-2/0/46 -> cr2-eqiad:xe-3/0/1 xe-2/0/47 -> cr2-eqiad:xe-4/0/1 xe-7/0/46 -> cr2-eqiad:xe-4/1/1 xe-7/0/47 -> cr2-eqiad:xe-3/1/1 ``` [] Re-enable cr2 interfaces [] Move VRRP master back to cr2 [] Verify no more traffic on asw-b<->asw2-b link [Arzhel] [] Disable asw-b<->asw2-b link [Arzhel] [] Verify all servers are healthy, monitoring happy **After maintenance window** [] Update interfaces descriptions on cr2 [] Cleanup config, monitoring, DNS, etc. [] Wipe & unrack asw-b

Similar to T148506. This is about row B only: [x] Rack and cable the switches according to diagram (blocked on T187118) [Chris] {F11996449} [x] Connect mgmt/serial [Chris] [x] Check via serial that switches work, ports are configured as down [Arzhel] [x] Stack the switch, upgrade JunOS, initial switch configuration [Arzhel] [x] Add to DNS [Arzhel] [x] Add to LibreNMS & Rancid [Arzhel] [x] Switch ports configuration to match asw-b (+login announcement) [Arzhel] [x] Solve snowflakes [Chris/Arzhel] ``` WAS xe-3/1/0 description "labnet1001 eth5" MOVED TO: xe-2/0/22 WAS xe-3/1/2 description "labnet1001 eth4" MOVED TO: xe-2/0/24 WAS ge-3/0/33 description "labnet1001 eth0" MOVED TO: ge-2/0/23 WAS xe-4/1/0 description "labnet1002 eth3" MOVED TO: xe-4/0/45 WAS xe-4/1/2 description "labnet1002 eth4" MOVED TO: xe-4/0/44 ``` [x] Pre populate FPC2, FPC4 and FPC7 (QFX) with copper SFPs matching the current production servers on rack 2, 4 and 7 [Chris] [x] Add to Icinga [Arzhel] **Thursday 22nd, noon Eastern (4pm UTC) 3h (for all 3 rows)** [x] Verify cr2-eqiad is VRRP master [x] Disable interfaces from cr1-eqiad to asw-b [x] Move cr1 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel] ``` xe-2/0/44 -> cr1-eqiad:xe-3/0/1 xe-2/0/45 -> cr1-eqiad:xe-4/0/1 xe-7/0/44 -> cr1-eqiad:xe-4/1/1 xe-7/0/45 -> cr1-eqiad:xe-3/1/1 ``` [x] Connect asw2-b with asw-b with 2x10G (and document cable IDs if different) [Chris] ``` xe-2/0/43 -> asw-b-eqiad:xe-2/1/0 xe-7/0/43 -> asw-b-eqiad:xe-7/1/0 ``` [x] Verify traffic is properly flowing though asw2-b [x] Update interfaces descriptions on cr1 **Before maintenance** [] Failover hosts TBD? **In maintenance window July 31st (3pm UTC, 11am EDT, 8am PDT), 4h.** [] Downtime switch/hosts in Icinga [] Failover VRRP master to cr1 [] Verify traffic is properly flowing through cr1/asw2 [] Disable interface between cr2 and asw-b-eqiad:ae2 [] Move servers from asw-b to asw2-b [Chris] Servers I wasn't able to identify a owner: ``` iron <- experimental bastion, no special needs ``` @elukey ``` aqs1008 ``` @ArielGlenn ``` dumpsdata1001 ``` @Dzahn ``` phab1001 ``` @ssastry (ruthenium is the Parosid test server so should be OK) ``` ruthenium ``` @fgiunchedi These need to be depooled from LVS one at a time and then re-pooled ``` thumbor1001 thumbor1002 ``` @Eevans / @mobrovac Cassandra instances there need to be drained before switching the server off, cf. https://wikitech.wikimedia.org/wiki/Cassandra ``` restbase-dev1005 ``` Decommissioned ``` promethium bast1001 californium silver ``` @elukey ``` druid1005 analytics1046 analytics1047 analytics1048 analytics1049 analytics1050 analytics1051 analytics1061 analytics1062 analytics1063 analytics1072 analytics1073 kafka1002 kafka-jumbo1003 notebook1003 mc1024 mc1025 mc1026 mc1027 ``` DBs @jcrespo @Marostegui already aware ``` db1051 -> scheduled for decommissioning T195484 host down, can be ignored probably db1052 db1072 -> misc master, special care needed see T183585#4427995 db1073 -> misc master special care needed see T183585#4427995 db1076 db1077 db1083 db1084 db1085 db1086 db1098 db1099 db1104 db1112 db1113 dbproxy1004 -> passive dbproxy1005 -> passive dbproxy1006 -> passive es1013 es1014 ``` @Gehel elastic* should not be an issue T187962#4238825 ``` elastic1028 elastic1036 elastic1037 elastic1038 elastic1039 elastic1046 elastic1047 elastic1049 elastic1050 logstash1005 maps1002 wdqs1007 ``` @akosiaris relevant "poolcounter1001 Turns out we did not really need it after all. The sites survived the downtime" T187962#4241998 ``` poolcounter1002 kubernetes1002 kubestage1002 ores1003 ores1004 puppetmaster1001 rhodium wtp1031 wtp1032 wtp1033 wtp1034 wtp1035 ``` Traffic: @Vgutierrez @ema me lvs: Disable puppet/pybal, make sure healthy before proceeding to next host rdns? ``` lvs1001:eth1 lvs1002:eth1 lvs1003:eth1 lvs1004 lvs1005 lvs1006 chromium ripe atlas <- can ignore ``` @fgiunchedi "ms-be* to be moved one at a time, just a clean poweroff is enough, no depooling needed." ``` ms-be1016 ms-be1017 ms-be1018 ms-be1020 ms-be1022 ms-be1023 ms-be1031 ms-be1032 ms-be1034 ``` @fgiunchedi Needs to be depooled from LVS, clean shut down and then repooled ``` prometheus1004 ``` @joe / @elukey ``` mw1284 mw1285 mw1286 mw1287 mw1288 mw1289 mw1290 mw1293 mw1294 mw1295 mw1296 mw1297 mw1298 mw1299 mw1300 mw1301 mw1302 mw1303 mw1304 mw1305 mw1306 mw1313 mw1314 mw1315 mw1316 mw1317 mw1318 ``` @Joe (on vacations during window) relevant: "conf1002 is in row C (etcd connections will be interrupted, we know it can cause issues)." T187962#4241998 ``` conf1005 rdb1004 ``` Cloud: @chasemp (on vacations during window) @Andrew ``` labcontrol1004 labnodepool1001 labnodepool1002 labpuppetmaster1001 labvirt1001 eth0 labvirt1001 eth1 labvirt1002 eth0 labvirt1002 eth1 labvirt1003 eth0 labvirt1003 eth1 labvirt1004 eth0 labvirt1004 eth1 labvirt1005 eth0 labvirt1005 eth1 labvirt1006 eth0 labvirt1006 eth1 labvirt1010 eth0 labvirt1010 eth1 labvirt1011 eth0 labvirt1011 eth1 labvirt1012 eth0 labvirt1012 eth1 labvirt1013 eth0 labvirt1013 eth1 labvirt1014 eth0 labvirt1014 eth1 labvirt1015-eth0 labvirt1015-eth1 labvirt1016-eth0 labvirt1016-eth1 labvirt1017 labvirt1017-eth1 labvirt1018 labvirt1018-eth1 labweb1001 virt1010 eth0 virt1010 eth1 virt1011 eth0 virt1011 eth1 virt1012 eth0 virt1012 eth1 ``` [] Move cr2 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel] ``` xe-2/0/46 -> cr2-eqiad:xe-3/0/1 xe-2/0/47 -> cr2-eqiad:xe-4/0/1 xe-7/0/46 -> cr2-eqiad:xe-4/1/1 xe-7/0/47 -> cr2-eqiad:xe-3/1/1 ``` [] Re-enable cr2 interfaces [] Move VRRP master back to cr2 [] Verify no more traffic on asw-b<->asw2-b link [Arzhel] [] Disable asw-b<->asw2-b link [Arzhel] [] Verify all servers are healthy, monitoring happy **After maintenance window** [] Update interfaces descriptions on cr2 [] Cleanup config, monitoring, DNS, etc. [] Wipe & unrack asw-b

Similar to T148506. This is about row B only: [x] Rack and cable the switches according to diagram (blocked on T187118) [Chris] {F11996449} [x] Connect mgmt/serial [Chris] [x] Check via serial that switches work, ports are configured as down [Arzhel] [x] Stack the switch, upgrade JunOS, initial switch configuration [Arzhel] [x] Add to DNS [Arzhel] [x] Add to LibreNMS & Rancid [Arzhel] [x] Switch ports configuration to match asw-b (+login announcement) [Arzhel] [x] Solve snowflakes [Chris/Arzhel] ``` WAS xe-3/1/0 description "labnet1001 eth5" MOVED TO: xe-2/0/22 WAS xe-3/1/2 description "labnet1001 eth4" MOVED TO: xe-2/0/24 WAS ge-3/0/33 description "labnet1001 eth0" MOVED TO: ge-2/0/23 WAS xe-4/1/0 description "labnet1002 eth3" MOVED TO: xe-4/0/45 WAS xe-4/1/2 description "labnet1002 eth4" MOVED TO: xe-4/0/44 ``` [x] Pre populate FPC2, FPC4 and FPC7 (QFX) with copper SFPs matching the current production servers on rack 2, 4 and 7 [Chris] [x] Add to Icinga [Arzhel] **Thursday 22nd, noon Eastern (4pm UTC) 3h (for all 3 rows)** [x] Verify cr2-eqiad is VRRP master [x] Disable interfaces from cr1-eqiad to asw-b [x] Move cr1 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel] ``` xe-2/0/44 -> cr1-eqiad:xe-3/0/1 xe-2/0/45 -> cr1-eqiad:xe-4/0/1 xe-7/0/44 -> cr1-eqiad:xe-4/1/1 xe-7/0/45 -> cr1-eqiad:xe-3/1/1 ``` [x] Connect asw2-b with asw-b with 2x10G (and document cable IDs if different) [Chris] ``` xe-2/0/43 -> asw-b-eqiad:xe-2/1/0 xe-7/0/43 -> asw-b-eqiad:xe-7/1/0 ``` [x] Verify traffic is properly flowing though asw2-b [x] Update interfaces descriptions on cr1 **Before maintenance** [] Failover hosts TBD? **In maintenance window April 10thJuly 31st (3pm UTC, 11am EDT, 8am PDT), 4h.** [] Downtime switch/hosts in Icinga [] Failover VRRP master to cr1 [] Verify traffic is properly flowing through cr1/asw2 [] Disable interface between cr2 and asw-b-eqiad:ae2 [] Move servers from asw-b to asw2-b [Chris] Servers I wasn't able to identify a owner: ``` iron <- experimental bastion, no special needs ``` @elukey ``` aqs1008 ``` @ArielGlenn ``` dumpsdata1001 ``` @Dzahn ``` phab1001 ``` @ssastry (ruthenium is the Parosid test server so should be OK) ``` ruthenium ``` @fgiunchedi These need to be depooled from LVS one at a time and then re-pooled ``` thumbor1001 thumbor1002 ``` @Eevans / @mobrovac Cassandra instances there need to be drained before switching the server off, cf. https://wikitech.wikimedia.org/wiki/Cassandra ``` restbase-dev1005 ``` Decommissioned ``` promethium bast1001 californium silver ``` @elukey ``` druid1005 analytics1046 analytics1047 analytics1048 analytics1049 analytics1050 analytics1051 analytics1061 analytics1062 analytics1063 analytics1072 analytics1073 kafka1002 kafka-jumbo1003 notebook1003 mc1024 mc1025 mc1026 mc1027 ``` DBs @jcrespo @Marostegui already aware ``` db1051 -> scheduled for decommissioning T195484 host down, can be ignored probably db1052 db1072 -> misc master, special care needed see T183585#4427995 db1073 -> misc master special care needed see T183585#4427995 db1076 db1077 db1083 db1084 db1085 db1086 db1098 db1099 db1104 db1112 db1113 dbproxy1004 -> passive dbproxy1005 -> passive dbproxy1006 -> passive es1013 es1014 ``` @Gehel elastic* should not be an issue T187962#4238825 ``` elastic1028 elastic1036 elastic1037 elastic1038 elastic1039 elastic1046 elastic1047 elastic1049 elastic1050 logstash1005 maps1002 wdqs1007 ``` @akosiaris relevant "poolcounter1001 Turns out we did not really need it after all. The sites survived the downtime" T187962#4241998 ``` poolcounter1002 kubernetes1002 kubestage1002 ores1003 ores1004 puppetmaster1001 rhodium wtp1031 wtp1032 wtp1033 wtp1034 wtp1035 ``` Traffic: @Vgutierrez @ema me lvs: Disable puppet/pybal, make sure healthy before proceeding to next host rdns? ``` lvs1001:eth1 lvs1002:eth1 lvs1003:eth1 lvs1004 lvs1005 lvs1006 chromium ripe atlas <- can ignore ``` @fgiunchedi "ms-be* to be moved one at a time, just a clean poweroff is enough, no depooling needed." ``` ms-be1016 ms-be1017 ms-be1018 ms-be1020 ms-be1022 ms-be1023 ms-be1031 ms-be1032 ms-be1034 ``` @fgiunchedi Needs to be depooled from LVS, clean shut down and then repooled ``` prometheus1004 ``` @joe / @elukey ``` mw1284 mw1285 mw1286 mw1287 mw1288 mw1289 mw1290 mw1293 mw1294 mw1295 mw1296 mw1297 mw1298 mw1299 mw1300 mw1301 mw1302 mw1303 mw1304 mw1305 mw1306 mw1313 mw1314 mw1315 mw1316 mw1317 mw1318 ``` @Joe (on vacations during window) relevant: "conf1002 is in row C (etcd connections will be interrupted, we know it can cause issues)." T187962#4241998 ``` conf1005 rdb1004 ``` Cloud: @chasemp (on vacations during window) @Andrew ``` labcontrol1004 labnodepool1001 labnodepool1002 labpuppetmaster1001 labvirt1001 eth0 labvirt1001 eth1 labvirt1002 eth0 labvirt1002 eth1 labvirt1003 eth0 labvirt1003 eth1 labvirt1004 eth0 labvirt1004 eth1 labvirt1005 eth0 labvirt1005 eth1 labvirt1006 eth0 labvirt1006 eth1 labvirt1010 eth0 labvirt1010 eth1 labvirt1011 eth0 labvirt1011 eth1 labvirt1012 eth0 labvirt1012 eth1 labvirt1013 eth0 labvirt1013 eth1 labvirt1014 eth0 labvirt1014 eth1 labvirt1015-eth0 labvirt1015-eth1 labvirt1016-eth0 labvirt1016-eth1 labvirt1017 labvirt1017-eth1 labvirt1018 labvirt1018-eth1 labweb1001 virt1010 eth0 virt1010 eth1 virt1011 eth0 virt1011 eth1 virt1012 eth0 virt1012 eth1 ``` [] Move cr2 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel] ``` xe-2/0/46 -> cr2-eqiad:xe-3/0/1 xe-2/0/47 -> cr2-eqiad:xe-4/0/1 xe-7/0/46 -> cr2-eqiad:xe-4/1/1 xe-7/0/47 -> cr2-eqiad:xe-3/1/1 ``` [] Re-enable cr2 interfaces [] Move VRRP master back to cr2 [] Verify no more traffic on asw-b<->asw2-b link [Arzhel] [] Disable asw-b<->asw2-b link [Arzhel] [] Verify all servers are healthy, monitoring happy **After maintenance window** [] Update interfaces descriptions on cr2 [] Cleanup config, monitoring, DNS, etc. [] Wipe & unrack asw-b