Page MenuHomePhabricator

Rack/cable/configure asw2-b-eqiad switch stack
Closed, ResolvedPublic

Description

Similar to T148506.

This is about row B only:

  • Rack and cable the switches according to diagram (blocked on T187118) [Chris]
    rows-abc-eqiad-cabling.png (1×1 px, 73 KB)
  • Connect mgmt/serial [Chris]
  • Check via serial that switches work, ports are configured as down [Arzhel]
  • Stack the switch, upgrade JunOS, initial switch configuration [Arzhel]
  • Add to DNS [Arzhel]
  • Add to LibreNMS & Rancid [Arzhel]
  • Switch ports configuration to match asw-b (+login announcement) [Arzhel]
  • Solve snowflakes [Chris/Arzhel]
WAS xe-3/1/0 description "labnet1001 eth5"  MOVED TO: xe-2/0/22
WAS xe-3/1/2 description "labnet1001 eth4"  MOVED TO: xe-2/0/24
WAS ge-3/0/33 description "labnet1001 eth0" MOVED TO: ge-2/0/23
WAS xe-4/1/0 description "labnet1002 eth3" MOVED TO: xe-4/0/45
WAS xe-4/1/2 description "labnet1002 eth4" MOVED TO: xe-4/0/44
  • Pre populate FPC2, FPC4 and FPC7 (QFX) with copper SFPs matching the current production servers on rack 2, 4 and 7 [Chris]
  • Add to Icinga [Arzhel]

Thursday 22nd, noon Eastern (4pm UTC) 3h (for all 3 rows)

  • Verify cr2-eqiad is VRRP master
  • Disable interfaces from cr1-eqiad to asw-b
  • Move cr1 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel]
xe-2/0/44 -> cr1-eqiad:xe-3/0/1
xe-2/0/45 -> cr1-eqiad:xe-4/0/1
xe-7/0/44 -> cr1-eqiad:xe-4/1/1
xe-7/0/45 -> cr1-eqiad:xe-3/1/1
  • Connect asw2-b with asw-b with 2x10G (and document cable IDs if different) [Chris]
xe-2/0/43 -> asw-b-eqiad:xe-2/1/0
xe-7/0/43 -> asw-b-eqiad:xe-7/1/0
  • Verify traffic is properly flowing though asw2-b
  • Update interfaces descriptions on cr1

Before maintenance

  • Failover hosts

In maintenance window July 31st (3pm UTC, 11am EDT, 8am PDT), 4h.

  • Move servers from asw-b to asw2-b (1st batch) [Chris]

(full list in T183585#4466638)

ge-8/0/6        up    up   dumpsdata1001
  • puppetmaster1001
  • October 24th 14:00 UTC server move (3rd batch) (cloud)
ge-5/0/6        up    up   labvirt1009:eth0
ge-5/0/7        up    up   labvirt1008:eth0
ge-5/0/8        up    up   labvirt1007:eth0
ge-5/0/9        up    up   labvirt1009:eth1
ge-5/0/10       up    up   labvirt1008:eth1
ge-5/0/11       up    up   labvirt1007:eth1
ge-5/0/18       up    up   labvirt1004 eth0
ge-5/0/19       up    up   labvirt1005 eth0
ge-5/0/20       up    up   labvirt1006 eth0
ge-5/0/21       up    up   labvirt1004 eth1
ge-5/0/22       up    up   labvirt1005 eth1
ge-5/0/23       up    up   labvirt1006 eth1
ge-2/0/20       up    up   labvirt1015-eth0
ge-2/0/21       up    up   labvirt1015-eth1
ge-3/0/7        up    up   labvirt1012 eth0
ge-3/0/8        up    up   labvirt1001 eth0
ge-3/0/9        up    up   labvirt1012 eth1
ge-3/0/14       up    up   labvirt1010 eth0
ge-3/0/15       up    up   labvirt1010 eth1
ge-3/0/16       up    up   labvirt1011 eth0
ge-3/0/17       up    up   labvirt1011 eth1
ge-3/0/18       up    up   labnodepool1001
ge-3/0/20       up    up   labvirt1002 eth0
ge-3/0/21       up    up   labvirt1003 eth0
ge-3/0/37       up    up   labvirt1001 eth1
ge-3/0/38       up    up   labvirt1002 eth1
ge-3/0/39       up    up   labvirt1003 eth1
ge-4/0/0        up    up   labvirt1013 eth0
ge-4/0/9        up    up   labvirt1016-eth0
ge-4/0/12       up    up   labvirt1016-eth1
ge-4/0/18       up    down labvirt1021 eth2
ge-4/0/36       up    up   labvirt1013 eth1
ge-4/0/39       up    down labnet1002 eth1
xe-4/1/0        up    down labnet1002 eth3
xe-4/1/2        up    down labnet1002 eth4
ge-5/0/0        up    up   labvirt1014 eth0
ge-5/0/3        up    up   labvirt1014 eth1
ge-7/0/6        up    up   labvirt1017
ge-7/0/11       up    up   labvirt1017-eth1
ge-7/0/13       up    down labvirt1020
ge-8/0/5        up    up   labvirt1018
ge-8/0/11       up    up   labpuppetmaster1001
ge-8/0/12       up    down labnet1004
ge-8/0/13       up    up   labvirt1018-eth1
ge-8/0/23       up    down labvirt1022 eth3
  • Failover VRRP master to cr1
  • Verify traffic is properly flowing through cr1/asw2
  • Disable interface between cr2 and asw-b-eqiad:ae2
  • Move cr2 router uplinks from asw-b to asw2-b (and document cable IDs if different) [Chris/Arzhel]
xe-2/0/46 -> cr2-eqiad:xe-3/0/1
xe-2/0/47 -> cr2-eqiad:xe-4/0/1
xe-7/0/46 -> cr2-eqiad:xe-4/1/1
xe-7/0/47 -> cr2-eqiad:xe-3/1/1
  • Re-enable cr2 interfaces
  • Move VRRP master back to cr2
  • Verify no more traffic on asw-b<->asw2-b link [Arzhel]
  • Disable asw-b<->asw2-b link [Arzhel]
  • Verify all servers are healthy, monitoring happy

After maintenance window

  • Update interfaces descriptions on cr2
  • Cleanup config, monitoring, DNS, etc.
  • Wipe & unrack asw-b

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

disabling coppers port for decom on T176957.

robh@asw-b-eqiad# show | compare
[edit interfaces ge-4/0/5]
+ disable;

adding second ethernet cables
labvirt1021 (eth3) ge-4/0/34 (added sfp-t to new switch)
labvirt1022 (eth3) ge-8/0/23

ayounsi updated the task description. (Show Details)

@Cmjohnson those too VC ports show up as down, could you please look at the cabling?

ayounsi@asw2-b-eqiad> show virtual-chassis vc-port | except vcp- 
fpc1:
PIC / Port   Type              Trunk  Status       Speed        Neighbor
1/2         Configured         -1    Down         40000
----------
fpc8:
PIC / Port   Type              Trunk  Status       Speed        Neighbor
1/2         Configured         -1    Down         40000

@ayounsi FYI I deleted 2 interfaces from b4

cmjohnson@asw-b-eqiad# show |compare
[edit interfaces]

  • ge-4/0/23 {
  • description cerium;
  • disable;
  • }
  • ge-4/0/24 {
  • description praseodymium;
  • disable;
  • }

@ayounsi FYI I deleted 2 interfaces from b4

asw2-b updated.

@ayounsi please check

xe-2/0/44 -> cr1-eqiad:xe-3/0/1 #1989
xe-2/0/45 -> cr1-eqiad:xe-4/0/1 #3457
xe-7/0/44 -> cr1-eqiad:xe-4/1/1 #3459
xe-7/0/45 -> cr1-eqiad:xe-3/1/1 #1991

xe-2/0/43 -> asw-b-eqiad:xe-2/1/0 #3025
xe-7/0/43 -> asw-b-eqiad:xe-7/1/0 #2993

EDIT: Outdated, please see a more recent comment bellow.

From a DB point of view I don't think that will be possible on that given date.

Enwiki master is on that row, as well as one of the es masters.

Specially for Enwiki master we were hoping to get it failed over during the DC switch - as that host will need to be decommissioned soon (T186320)

Failing it over requires downtime, doing it during the DC switch requires no read only time and it is less risky and less impacting. That's why we are aiming for that.

Removed db1020 switch port ge-1/0/4

Removed db1020 switch port ge-1/0/4

asw2-b updated.

Cmjohnson reopened this task as Open.
Cmjohnson claimed this task.
This comment was removed by RobH.

I disabled ge-1/0/1 graphite1002 for decom...i did this on both switches. The port labels were not changed.

Disabled mobile1004 from b4 4/0/16 on both switches.....description remains mobile1004 until removed from rack.

Aiming at doing the asw-b to asw2-b migration on July 31st (3pm UTC, 11am EDT, 8am PDT), 4h.
due to people's vacations, we might have to do that move in waves. What can't be moved on that day will move in a later window.

Exact Timeline TBD, but similar to T148506 (and somewhat to T169345)

Here is the list of servers that need to move as of today.
They will each need a few seconds downtime, the time to move their uplinks to the new switches.

Please let me know if we need to take particular care of some servers (eg. wait for one cluster member to come up before moving to the next one, failover services, etc).

People subscribed to the task should know the drill.
I changed a bit the format, listing hosts more accurately and trying to ping the correct persons for feedbacks. (Still using that etherpad: https://etherpad.wikimedia.org/p/p0Iq39YWZg )
Don't hesitate to add people or remove yourself.

The whole list is in the task description so you can edit it.

@ayounsi regarding databases

All these are passive, so no special care is needed
dbproxy1004
dbproxy1005
dbproxy1006

db1072 -> m3 master. This will affect writes on Phabricator, so let's try to keep the downtime to the minimum (CC @mmodell). The downtime when we this is on the fly with s6 db primary master (frwiki, ruwiki,jawiki) was like 5 seconds, which is ok I would say.

db1073 -> m5 master. This will affect writes on (CC cloud-services-team), so let's try to keep downtime to the minimum :

root@db1073.eqiad.wmnet[(none)]> show databases;
+------------------------+
| Database               |
+------------------------+
| designate              |
| designate_pool_manager |
| glance                 |
| heartbeat              |
| information_schema     |
| keystone               |
| labsdbaccounts         |
| labspuppet             |
| labswiki               |
| labtestwiki            |
| mysql                  |
| neutron                |
| nodepooldb             |
| nova                   |
| nova_api               |
| performance_schema     |
| striker                |
| test_labsdbaccounts    |
| testreduce_0715        |
| testreduce_vd          |
+------------------------+

db1052 is s1 primary master which will be failed over before the maintenance T197069
es1014 is es3 primary master which will be failed over before the maintenance T197073

As an addemdum to T183585#4427995, because of T180918, we need to depool, ahead of the maintenance, the other replica dbs, too, but that should be trivial to do.

Aiming at doing the asw-b to asw2-b migration on July 31st (3pm UTC, 11am EDT, 8am PDT), 4h.
due to people's vacations, we might have to do that move in waves. What can't be moved on that day will move in a later window.

If that is the final day/time could you put it somewhere in the task description, as it is hard to find it among all the comments and all that
Thanks!

Adding some comments about all the hosts in which I am listed in:

  • aqs1008 - would just need to be depooled via pybal before proceeding
  • druid1005 - would need to be depooled as well
  • analytics10[46-51,61-63,72,73], if possible to do two at the time rather than all in once it would be really great :)
  • kafka1002 - this one would need to be depooled from pybal (adding @mobrovac as FYI)
  • kafka-jumbo1003 - no issue
  • notebook1003 - no issue
  • mc10[24-27] - these shouldn't be a big problem but if possible I'd do one of them at the time (memcached hosts).

I am also going with the assumption that the maintenance will imply only a brief network outage (if any).

I noticed some hosts assigned to @Joe only, since he is on vacation I'll add some comments:

  • rdb1004 - IIUC from T196685#4267110 it should be decommed, so no issue.
  • conf1005 - this one only runs, for the moment, zookeeper (and not etcd), so it is an analytics only concern. No action is required (but a heads up before would be great).

For the mediawiki hosts:

mw12[84-90] - APIs Row B

mw129[3-6] - videoscalers

mw1297 - ?? - I don't see anything related to this host in puppet

mw1298 - spare

mw[1299-1306] - jobrunners

mw13[13-17] - APIs Row B

mw1318 - videoscaler

I am not seeing big issues for the above hosts, but as I wrote before these needs to be done in small batches to avoid too many hosts down (especially the API servers). I am not sure how @Joe managed this the last time, I could be available (if he is still on vacation) to depool/pool servers if needed.

elastic*, logstash* and wdqs* should be entirely transparent. Elastic might scream a bit about too many unallocated shards, but at most this should cause some minor response time degradation.

maps* /should/ be ok as well, but I don't have as much experience with this kind of operations on maps. Ping me when touching it and I'll keep an eye on it.

Change 449141 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool all the hosts in row B

https://gerrit.wikimedia.org/r/449141

Change 449141 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool all the hosts in row B

https://gerrit.wikimedia.org/r/449141

Mentioned in SAL (#wikimedia-operations) [2018-07-31T04:54:51Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool all hosts in row B - T183585 (duration: 00m 50s)

Change 449400 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool external store hosts in row B

https://gerrit.wikimedia.org/r/449400

Change 449400 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool external store hosts in row B

https://gerrit.wikimedia.org/r/449400

Mentioned in SAL (#wikimedia-operations) [2018-07-31T15:01:59Z] <XioNoX> starting the eqiad row B servers move - T183585

Change 449484 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] eqiad: temporarily remove chromium from LVS nameservers

https://gerrit.wikimedia.org/r/449484

Change 449484 merged by Ayounsi:
[operations/puppet@production] eqiad: temporarily remove chromium from LVS nameservers

https://gerrit.wikimedia.org/r/449484

Mentioned in SAL (#wikimedia-operations) [2018-07-31T17:12:55Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool all hosts in row B - T183585 (duration: 00m 51s)

Will cleanup description with remaining servers

Not moved: puppetmaster1001, see T200838

Moved:

=== No special needs ===

ge-1/0/2        up    up   wdqs1007
ge-1/0/9        up    up   db1083
ge-1/0/11       up    up   ms-be1022
ge-1/0/12       up    up   db1112
ge-1/0/16       up    up   kafka-jumbo1003
ge-1/0/17       up    up   db1076
ge-1/0/19       up    up   db1084
ge-1/0/20       up    up   es1013
ge-1/0/21       up    up   es1014
ge-1/0/22       up    up   db1077

ge-2/0/0        up    up   analytics1072
ge-2/0/9        up    up   db1099
ge-2/0/18       up    up   ms-be1020
ge-2/0/19       up    up   db1072  <- minimise downtime  (issue)

ge-3/0/0        up    up   ms-be1023
ge-3/0/1        up    up   db1051
ge-3/0/2        up    up   db1052
ge-3/0/3        up    up   analytics1046 <- do two at a time
ge-3/0/4        up    up   analytics1047
ge-3/0/5        up    up   analytics1048 <- do two at a time
ge-3/0/6        up    up   analytics1049
ge-3/0/10       up    up   analytics1050 <- do two at a time
ge-3/0/11       up    up   analytics1051
ge-3/0/12       up    up   elastic1036
ge-3/0/13       up    up   elastic1037
ge-3/0/22       up    up   elastic1038
ge-3/0/23       up    up   elastic1039
ge-3/0/19       up    up   promethium
ge-3/0/24       up    up   db1085
ge-3/0/25       up    up   db1086
ge-3/0/27       up    up   ms-be1031
ge-3/0/29       up    up   db1104
ge-3/0/26       up    up   db1073  <- minimise downtime

ge-4/0/1        up    up   logstash1005  (issue, port was admin down)
ge-4/0/2        up    up   maps1002   <- Ping gehel before moving  (issue, port was admin down)
ge-4/0/7        up    up   kubestage1002
ge-4/0/8        up    up   iron
ge-4/0/13       up    up   bast1001
ge-4/0/14       up    up   phab1001
ge-4/0/15       up    up   poolcounter1002
ge-4/0/25       up    up   ruthenium
ge-4/0/27       up    up   elastic1049
ge-4/0/28       up    up   silver
ge-4/0/31       up    up   elastic1050
ge-4/0/32       up    up   conf1005  <- head's up to Luca
ge-4/0/37       up    up   ripe atlas
ge-4/0/38       up    up   californium
ge-4/0/43       up    up   rdb1004
ge-4/0/6        up    up   kafka1002 <- Depool in pybal (done)  (issue, connected to wrong port)
ge-4/0/30       up    up   prometheus1004 <- Depool in pybal (done+repooled)

ge-5/0/1        up    up   ms-be1032
ge-5/0/2        up    up   ms-be1034
ge-5/0/5        up    up   db1098
ge-5/0/12       up    up   dbproxy1004
ge-5/0/13       up    up   dbproxy1005
ge-5/0/14       up    up   dbproxy1006
ge-5/0/15       up    up   ms-be1016
ge-5/0/16       up    up   ms-be1017
ge-5/0/17       up    up   ms-be1018

B6:
ge-6/0/0        up    up   mw1284 <- do in small batch
ge-6/0/1        up    up   mw1285
ge-6/0/2        up    up   mw1286
ge-6/0/3        up    up   mw1287
ge-6/0/4        up    up   mw1288
ge-6/0/5        up    up   mw1289
ge-6/0/6        up    up   mw1290
ge-6/0/7        up    up   thumbor1001 <- Depool in pybal, one at a time (done)
ge-6/0/8        up    up   thumbor1002 <- Depool in pybal, one at a time (done)
ge-6/0/9        up    up   mw1293
ge-6/0/10       up    up   mw1294
ge-6/0/11       up    up   mw1295
ge-6/0/12       up    up   mw1296
ge-6/0/13       up    up   mw1297
ge-6/0/14       up    up   mw1298
ge-6/0/15       up    up   mw1299
ge-6/0/16       up    up   mw1300
ge-6/0/17       up    up   mw1301
ge-6/0/18       up    up   mw1302
ge-6/0/19       up    up   mw1303
ge-6/0/20       up    up   mw1304
ge-6/0/21       up    up   mw1305
ge-6/0/22       up    up   mw1306
ge-6/0/23       up    up   elastic1046
ge-6/0/24       up    up   elastic1047
ge-6/0/25       up    up   elastic1028
ge-6/0/27       up    up   kubernetes1002
ge-6/0/28       up    up   aqs1008 <- Depool via pybal (depooled wrong host...)
ge-6/0/34       up    up   mc1024  <- Do one at a time
ge-6/0/35       up    up   mc1025
ge-6/0/36       up    up   mc1026  <- Do one at a time
ge-6/0/37       up    up   mc1027

B7:
ge-7/0/0        up    up   mw1313 <- do in small batch
ge-7/0/1        up    up   mw1314
ge-7/0/2        up    up   mw1315
ge-7/0/3        up    up   mw1316
ge-7/0/4        up    up   mw1317
ge-7/0/5        up    up   mw1318
ge-7/0/8        up    up   restbase-dev1005
ge-7/0/10       up    up   ores1003
ge-7/0/12       up    up   druid1005<- Depool in pybal (done+repooled)
ge-7/0/15       up    up   analytics1073
ge-7/0/26       up    up   wtp1031
ge-7/0/27       up    up   wtp1032
ge-7/0/28       up    up   wtp1033

B8:
ge-8/0/2        up    up   analytics1061 <- Do two at the time
ge-8/0/3        up    up   analytics1062
ge-8/0/4        up    up   analytics1063
ge-8/0/8        up    up   wtp1034
ge-8/0/9        up    up   wtp1035
ge-8/0/10       up    up   ores1004
ge-8/0/15       up    up   db1113
ge-8/0/17       up    up   notebook1003

=== Later in the maintenance ===
ge-5/0/4        up    up   labweb1001
ge-7/0/7        up    up   labcontrol1004
ge-8/0/14       up    up   labnodepool1002

=== Need to be depooled before move ===

B4:
ge-4/0/10       up    up   chromium  <- Depool in pybal + remove from lvs resolv.conf (done+repooled)
ge-4/0/19       up    up   lvs1004 <- puppet off, pybal off
ge-4/0/20       up    up   lvs1005 <- puppet off, pybal off
ge-4/0/21       up    up   lvs1006 <- puppet off, pybal off

B6:
ge-6/0/45       up    up   lvs1001:eth1 <- puppet off, pybal off
ge-6/0/46       up    up   lvs1002:eth1 <- puppet off, pybal off
ge-6/0/47       up    up   lvs1003:eth1 <- spare

=== Disable puppet everywhere before move ===
ge-4/0/26       up    up   rhodium
ayounsi updated the task description. (Show Details)
ayounsi updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:02:51Z] <XioNoX> set vrrp priority 70 on cr2-eqiad:ae2 to failover VIP to cr1 - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:26:40Z] <XioNoX> re-enable ae2 on cr2-eqiad - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:37:25Z] <XioNoX> remove vrrp priority 70 on cr2-eqiad:ae2 to failback VIPs to cr2 - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:41:19Z] <XioNoX> remove asw-b-eqiad from LibreNMS - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:43:50Z] <XioNoX> delete asw2-b - asw-b interface - T183585

Change 471781 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Remove asw-b-eqiad from Rancid, Icinga and Smokeping

https://gerrit.wikimedia.org/r/471781

Change 471783 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Remove asw-b-eqiad mgmt from DNS

https://gerrit.wikimedia.org/r/471783

Change 471781 merged by Ayounsi:
[operations/puppet@production] Remove asw-b-eqiad from Rancid, Icinga and Smokeping

https://gerrit.wikimedia.org/r/471781

Finally all set here. Thank you all. Opened T208788 for the wipe & unrack.

Change 471783 merged by Ayounsi:
[operations/dns@master] Remove asw-b-eqiad mgmt from DNS

https://gerrit.wikimedia.org/r/471783