⚓ T183585 Rack/cable/configure asw2-b-eqiad switch stack

Subject	Repo	Branch	Lines +/-
Remove asw-b-eqiad mgmt from DNS	operations/dns	master	+2 -2
Remove asw-b-eqiad from Rancid, Icinga and Smokeping	operations/puppet	production	+0 -9
eqiad: temporarily remove chromium from LVS nameservers	operations/puppet	production	+0 -1
mariadb: Depool external store hosts in row B	operations/mediawiki-config	master	+3 -3
db-eqiad.php: Depool all the hosts in row B	operations/mediawiki-config	master	+58 -58

Status	Assigned	Task
Invalid	ayounsi	T199142 Increase network capacity (2018-19 Q1 Goal)
Resolved	• Cmjohnson	T183585 Rack/cable/configure asw2-b-eqiad switch stack
Resolved	Marostegui	T197069 Failover db1052 (s1) db primary master
Resolved	Johan	T197134 Announce 30 minutes read-only time for enwiki 18th July 06:00AM UTC
Resolved	jcrespo	T197073 switchover es1014 to es1017
Resolved	• Cmjohnson	T197072 Physically move es1017 from D to C row
Resolved	jcrespo	T199224 Test database master switchover script on codfw
Resolved	ayounsi	T200838 v6 ND failure on puppetmaster1001/asw2-b-eqiad

asw2-b-eqiad updated.

disabling coppers port for decom on T176957.

robh@asw-b-eqiad# show | compare
[edit interfaces ge-4/0/5]
+ disable;

adding second ethernet cables
labvirt1021 (eth3) ge-4/0/34 (added sfp-t to new switch)
labvirt1022 (eth3) ge-8/0/23

ayounsi updated the task description. (Show Details)Mar 16 2018, 2:06 AM

ayounsi updated the task description. (Show Details)

@Cmjohnson those too VC ports show up as down, could you please look at the cabling?

ayounsi@asw2-b-eqiad> show virtual-chassis vc-port | except vcp- 
fpc1:
PIC / Port   Type              Trunk  Status       Speed        Neighbor
1/2         Configured         -1    Down         40000
----------
fpc8:
PIC / Port   Type              Trunk  Status       Speed        Neighbor
1/2         Configured         -1    Down         40000

https://phabricator.wikimedia.org/T183937#4056221

@ayounsi asw-b fpc1 - fpce8 should be up now

Confirmed, thanks!

ayounsi updated the task description. (Show Details)Mar 19 2018, 5:46 PM

ayounsi updated the task description. (Show Details)Mar 19 2018, 5:58 PM

@ayounsi FYI I deleted 2 interfaces from b4

cmjohnson@asw-b-eqiad# show |compare
[edit interfaces]

ge-4/0/23 {
description cerium;
disable;
}
ge-4/0/24 {
description praseodymium;
disable;
}

In T183585#4062289, @Cmjohnson wrote:

@ayounsi FYI I deleted 2 interfaces from b4

asw2-b updated.

@ayounsi please check

xe-2/0/44 -> cr1-eqiad:xe-3/0/1 #1989
xe-2/0/45 -> cr1-eqiad:xe-4/0/1 #3457
xe-7/0/44 -> cr1-eqiad:xe-4/1/1 #3459
xe-7/0/45 -> cr1-eqiad:xe-3/1/1 #1991

xe-2/0/43 -> asw-b-eqiad:xe-2/1/0 #3025
xe-7/0/43 -> asw-b-eqiad:xe-7/1/0 #2993

ayounsi updated the task description. (Show Details)Mar 22 2018, 7:15 PM

EDIT: Outdated, please see a more recent comment bellow.

ayounsi updated the task description. (Show Details)Mar 23 2018, 9:36 PM

From a DB point of view I don't think that will be possible on that given date.

Enwiki master is on that row, as well as one of the es masters.

Specially for Enwiki master we were hoping to get it failed over during the DC switch - as that host will need to be decommissioned soon (T186320)

Failing it over requires downtime, doing it during the DC switch requires no read only time and it is less risky and less impacting. That's why we are aiming for that.

Removed db1020 switch port ge-1/0/4

In T183585#4097254, @Cmjohnson wrote:

Removed db1020 switch port ge-1/0/4

asw2-b updated.

@ayounsi I added wdqs1007 to ge-1/0/2

• Cmjohnson closed this task as Resolved.Apr 4 2018, 6:57 PM

• Cmjohnson reopened this task as Open.

• Cmjohnson claimed this task.

I disabled ge-1/0/1 graphite1002 for decom...i did this on both switches. The port labels were not changed.

Disabled mobile1004 from b4 4/0/16 on both switches.....description remains mobile1004 until removed from rack.

Marostegui mentioned this in T197069: Failover db1052 (s1) db primary master.Jun 13 2018, 8:22 AM

jcrespo mentioned this in T197072: Physically move es1017 from D to C row.Jun 13 2018, 8:54 AM

jcrespo added a subtask: T197072: Physically move es1017 from D to C row.

jcrespo removed a subtask: T197072: Physically move es1017 from D to C row.Jun 13 2018, 9:04 AM

jcrespo added a subtask: T197073: switchover es1014 to es1017.

Marostegui mentioned this in T197134: Announce 30 minutes read-only time for enwiki 18th July 06:00AM UTC.Jun 13 2018, 3:40 PM

ayounsi added a parent task: T199142: Increase network capacity (2018-19 Q1 Goal).Jul 9 2018, 6:31 PM

Aiming at doing the asw-b to asw2-b migration on July 31st (3pm UTC, 11am EDT, 8am PDT), 4h.
due to people's vacations, we might have to do that move in waves. What can't be moved on that day will move in a later window.

Exact Timeline TBD, but similar to T148506 (and somewhat to T169345)

Here is the list of servers that need to move as of today.
They will each need a few seconds downtime, the time to move their uplinks to the new switches.

Please let me know if we need to take particular care of some servers (eg. wait for one cluster member to come up before moving to the next one, failover services, etc).

People subscribed to the task should know the drill.
I changed a bit the format, listing hosts more accurately and trying to ping the correct persons for feedbacks. (Still using that etherpad: https://etherpad.wikimedia.org/p/p0Iq39YWZg )
Don't hesitate to add people or remove yourself.

The whole list is in the task description so you can edit it.

ayounsi updated the task description. (Show Details)Jul 14 2018, 12:29 AM

ayounsi added subscribers: Vgutierrez, Eevans, • mobrovac.

fgiunchedi updated the task description. (Show Details)Jul 16 2018, 11:05 AM

@ayounsi regarding databases

All these are passive, so no special care is needed
dbproxy1004
dbproxy1005
dbproxy1006

db1072 -> m3 master. This will affect writes on Phabricator, so let's try to keep the downtime to the minimum (CC @mmodell). The downtime when we this is on the fly with s6 db primary master (frwiki, ruwiki,jawiki) was like 5 seconds, which is ok I would say.

db1073 -> m5 master. This will affect writes on (CC cloud-services-team), so let's try to keep downtime to the minimum :

root@db1073.eqiad.wmnet[(none)]> show databases;
+------------------------+
| Database               |
+------------------------+
| designate              |
| designate_pool_manager |
| glance                 |
| heartbeat              |
| information_schema     |
| keystone               |
| labsdbaccounts         |
| labspuppet             |
| labswiki               |
| labtestwiki            |
| mysql                  |
| neutron                |
| nodepooldb             |
| nova                   |
| nova_api               |
| performance_schema     |
| striker                |
| test_labsdbaccounts    |
| testreduce_0715        |
| testreduce_vd          |
+------------------------+

db1052 is s1 primary master which will be failed over before the maintenance T197069
es1014 is es3 primary master which will be failed over before the maintenance T197073

• mobrovac updated the task description. (Show Details)Jul 16 2018, 3:48 PM

• mobrovac added subscribers: • ssastry, Dzahn.

Marostegui updated the task description. (Show Details)Jul 16 2018, 3:52 PM

Marostegui closed subtask T197069: Failover db1052 (s1) db primary master as Resolved.Jul 18 2018, 6:43 AM

As an addemdum to T183585#4427995, because of T180918, we need to depool, ahead of the maintenance, the other replica dbs, too, but that should be trivial to do.

In T183585#4424395, @ayounsi wrote:

Aiming at doing the asw-b to asw2-b migration on July 31st (3pm UTC, 11am EDT, 8am PDT), 4h.
due to people's vacations, we might have to do that move in waves. What can't be moved on that day will move in a later window.

If that is the final day/time could you put it somewhere in the task description, as it is hard to find it among all the comments and all that
Thanks!

ayounsi updated the task description. (Show Details)Jul 19 2018, 3:13 PM

jcrespo mentioned this in T197073: switchover es1014 to es1017.Jul 25 2018, 7:31 AM

jcrespo closed subtask T197073: switchover es1014 to es1017 as Resolved.

elukey updated the task description. (Show Details)Jul 26 2018, 7:15 AM

Adding some comments about all the hosts in which I am listed in:

aqs1008 - would just need to be depooled via pybal before proceeding
druid1005 - would need to be depooled as well
analytics10[46-51,61-63,72,73], if possible to do two at the time rather than all in once it would be really great :)
kafka1002 - this one would need to be depooled from pybal (adding @mobrovac as FYI)
kafka-jumbo1003 - no issue
notebook1003 - no issue

mc10[24-27] - these shouldn't be a big problem but if possible I'd do one of them at the time (memcached hosts).

I am also going with the assumption that the maintenance will imply only a brief network outage (if any).

I noticed some hosts assigned to @Joe only, since he is on vacation I'll add some comments:

rdb1004 - IIUC from T196685#4267110 it should be decommed, so no issue.
conf1005 - this one only runs, for the moment, zookeeper (and not etcd), so it is an analytics only concern. No action is required (but a heads up before would be great).

For the mediawiki hosts:

mw12[84-90] - APIs Row B

mw129[3-6] - videoscalers

mw1297 - ?? - I don't see anything related to this host in puppet

mw1298 - spare

mw[1299-1306] - jobrunners

mw13[13-17] - APIs Row B

mw1318 - videoscaler

I am not seeing big issues for the above hosts, but as I wrote before these needs to be done in small batches to avoid too many hosts down (especially the API servers). I am not sure how @Joe managed this the last time, I could be available (if he is still on vacation) to depool/pool servers if needed.

elastic*, logstash* and wdqs* should be entirely transparent. Elastic might scream a bit about too many unallocated shards, but at most this should cause some minor response time degradation.

maps* /should/ be ok as well, but I don't have as much experience with this kind of operations on maps. Ping me when touching it and I'll keep an eye on it.

Change 449141 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool all the hosts in row B

https://gerrit.wikimedia.org/r/449141

gerritbot added a project: Patch-For-Review.Jul 30 2018, 10:07 AM

Change 449141 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool all the hosts in row B

https://gerrit.wikimedia.org/r/449141

Mentioned in SAL (#wikimedia-operations) [2018-07-31T04:54:51Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool all hosts in row B - T183585 (duration: 00m 50s)

Change 449400 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool external store hosts in row B

https://gerrit.wikimedia.org/r/449400

Change 449400 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool external store hosts in row B

https://gerrit.wikimedia.org/r/449400

Mentioned in SAL (#wikimedia-operations) [2018-07-31T15:01:59Z] <XioNoX> starting the eqiad row B servers move - T183585

Change 449484 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] eqiad: temporarily remove chromium from LVS nameservers

https://gerrit.wikimedia.org/r/449484

Change 449484 merged by Ayounsi:
[operations/puppet@production] eqiad: temporarily remove chromium from LVS nameservers

https://gerrit.wikimedia.org/r/449484

Mentioned in SAL (#wikimedia-operations) [2018-07-31T16:25:18Z] <godog> repool thumbor100[12] - T183585

Mentioned in SAL (#wikimedia-operations) [2018-07-31T17:12:55Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool all hosts in row B - T183585 (duration: 00m 51s)

Will cleanup description with remaining servers

Not moved: puppetmaster1001, see T200838

Moved:

=== No special needs ===

ge-1/0/2        up    up   wdqs1007
ge-1/0/9        up    up   db1083
ge-1/0/11       up    up   ms-be1022
ge-1/0/12       up    up   db1112
ge-1/0/16       up    up   kafka-jumbo1003
ge-1/0/17       up    up   db1076
ge-1/0/19       up    up   db1084
ge-1/0/20       up    up   es1013
ge-1/0/21       up    up   es1014
ge-1/0/22       up    up   db1077

ge-2/0/0        up    up   analytics1072
ge-2/0/9        up    up   db1099
ge-2/0/18       up    up   ms-be1020
ge-2/0/19       up    up   db1072  <- minimise downtime  (issue)

ge-3/0/0        up    up   ms-be1023
ge-3/0/1        up    up   db1051
ge-3/0/2        up    up   db1052
ge-3/0/3        up    up   analytics1046 <- do two at a time
ge-3/0/4        up    up   analytics1047
ge-3/0/5        up    up   analytics1048 <- do two at a time
ge-3/0/6        up    up   analytics1049
ge-3/0/10       up    up   analytics1050 <- do two at a time
ge-3/0/11       up    up   analytics1051
ge-3/0/12       up    up   elastic1036
ge-3/0/13       up    up   elastic1037
ge-3/0/22       up    up   elastic1038
ge-3/0/23       up    up   elastic1039
ge-3/0/19       up    up   promethium
ge-3/0/24       up    up   db1085
ge-3/0/25       up    up   db1086
ge-3/0/27       up    up   ms-be1031
ge-3/0/29       up    up   db1104
ge-3/0/26       up    up   db1073  <- minimise downtime

ge-4/0/1        up    up   logstash1005  (issue, port was admin down)
ge-4/0/2        up    up   maps1002   <- Ping gehel before moving  (issue, port was admin down)
ge-4/0/7        up    up   kubestage1002
ge-4/0/8        up    up   iron
ge-4/0/13       up    up   bast1001
ge-4/0/14       up    up   phab1001
ge-4/0/15       up    up   poolcounter1002
ge-4/0/25       up    up   ruthenium
ge-4/0/27       up    up   elastic1049
ge-4/0/28       up    up   silver
ge-4/0/31       up    up   elastic1050
ge-4/0/32       up    up   conf1005  <- head's up to Luca
ge-4/0/37       up    up   ripe atlas
ge-4/0/38       up    up   californium
ge-4/0/43       up    up   rdb1004
ge-4/0/6        up    up   kafka1002 <- Depool in pybal (done)  (issue, connected to wrong port)
ge-4/0/30       up    up   prometheus1004 <- Depool in pybal (done+repooled)

ge-5/0/1        up    up   ms-be1032
ge-5/0/2        up    up   ms-be1034
ge-5/0/5        up    up   db1098
ge-5/0/12       up    up   dbproxy1004
ge-5/0/13       up    up   dbproxy1005
ge-5/0/14       up    up   dbproxy1006
ge-5/0/15       up    up   ms-be1016
ge-5/0/16       up    up   ms-be1017
ge-5/0/17       up    up   ms-be1018

B6:
ge-6/0/0        up    up   mw1284 <- do in small batch
ge-6/0/1        up    up   mw1285
ge-6/0/2        up    up   mw1286
ge-6/0/3        up    up   mw1287
ge-6/0/4        up    up   mw1288
ge-6/0/5        up    up   mw1289
ge-6/0/6        up    up   mw1290
ge-6/0/7        up    up   thumbor1001 <- Depool in pybal, one at a time (done)
ge-6/0/8        up    up   thumbor1002 <- Depool in pybal, one at a time (done)
ge-6/0/9        up    up   mw1293
ge-6/0/10       up    up   mw1294
ge-6/0/11       up    up   mw1295
ge-6/0/12       up    up   mw1296
ge-6/0/13       up    up   mw1297
ge-6/0/14       up    up   mw1298
ge-6/0/15       up    up   mw1299
ge-6/0/16       up    up   mw1300
ge-6/0/17       up    up   mw1301
ge-6/0/18       up    up   mw1302
ge-6/0/19       up    up   mw1303
ge-6/0/20       up    up   mw1304
ge-6/0/21       up    up   mw1305
ge-6/0/22       up    up   mw1306
ge-6/0/23       up    up   elastic1046
ge-6/0/24       up    up   elastic1047
ge-6/0/25       up    up   elastic1028
ge-6/0/27       up    up   kubernetes1002
ge-6/0/28       up    up   aqs1008 <- Depool via pybal (depooled wrong host...)
ge-6/0/34       up    up   mc1024  <- Do one at a time
ge-6/0/35       up    up   mc1025
ge-6/0/36       up    up   mc1026  <- Do one at a time
ge-6/0/37       up    up   mc1027

B7:
ge-7/0/0        up    up   mw1313 <- do in small batch
ge-7/0/1        up    up   mw1314
ge-7/0/2        up    up   mw1315
ge-7/0/3        up    up   mw1316
ge-7/0/4        up    up   mw1317
ge-7/0/5        up    up   mw1318
ge-7/0/8        up    up   restbase-dev1005
ge-7/0/10       up    up   ores1003
ge-7/0/12       up    up   druid1005<- Depool in pybal (done+repooled)
ge-7/0/15       up    up   analytics1073
ge-7/0/26       up    up   wtp1031
ge-7/0/27       up    up   wtp1032
ge-7/0/28       up    up   wtp1033

B8:
ge-8/0/2        up    up   analytics1061 <- Do two at the time
ge-8/0/3        up    up   analytics1062
ge-8/0/4        up    up   analytics1063
ge-8/0/8        up    up   wtp1034
ge-8/0/9        up    up   wtp1035
ge-8/0/10       up    up   ores1004
ge-8/0/15       up    up   db1113
ge-8/0/17       up    up   notebook1003

=== Later in the maintenance ===
ge-5/0/4        up    up   labweb1001
ge-7/0/7        up    up   labcontrol1004
ge-8/0/14       up    up   labnodepool1002

=== Need to be depooled before move ===

B4:
ge-4/0/10       up    up   chromium  <- Depool in pybal + remove from lvs resolv.conf (done+repooled)
ge-4/0/19       up    up   lvs1004 <- puppet off, pybal off
ge-4/0/20       up    up   lvs1005 <- puppet off, pybal off
ge-4/0/21       up    up   lvs1006 <- puppet off, pybal off

B6:
ge-6/0/45       up    up   lvs1001:eth1 <- puppet off, pybal off
ge-6/0/46       up    up   lvs1002:eth1 <- puppet off, pybal off
ge-6/0/47       up    up   lvs1003:eth1 <- spare

=== Disable puppet everywhere before move ===
ge-4/0/26       up    up   rhodium

ayounsi edited projects, added netops; removed Patch-For-Review.Jul 31 2018, 7:22 PM

ayounsi updated the task description. (Show Details)

• Cmjohnson moved this task from High Priority Task to Blocked on the ops-eqiad board.Aug 30 2018, 4:52 PM

ayounsi closed subtask T200838: v6 ND failure on puppetmaster1001/asw2-b-eqiad as Resolved.Oct 15 2018, 1:01 PM

ayounsi mentioned this in T207278: Move dumpsdata1001.Oct 17 2018, 2:17 PM

ayounsi updated the task description. (Show Details)

ayounsi mentioned this in T207668: Increase network capacity (2018-19 Q2 Goal).Oct 22 2018, 4:00 PM

ayounsi updated the task description. (Show Details)Oct 31 2018, 4:20 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:02:51Z] <XioNoX> set vrrp priority 70 on cr2-eqiad:ae2 to failover VIP to cr1 - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:05:31Z] <XioNoX> disable ae2 on cr2-eqiad - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:26:40Z] <XioNoX> re-enable ae2 on cr2-eqiad - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:37:25Z] <XioNoX> remove vrrp priority 70 on cr2-eqiad:ae2 to failback VIPs to cr2 - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:41:19Z] <XioNoX> remove asw-b-eqiad from LibreNMS - T183585

Mentioned in SAL (#wikimedia-operations) [2018-11-05T18:43:50Z] <XioNoX> delete asw2-b - asw-b interface - T183585

Change 471781 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Remove asw-b-eqiad from Rancid, Icinga and Smokeping

https://gerrit.wikimedia.org/r/471781

gerritbot added a project: Patch-For-Review.Nov 5 2018, 6:50 PM

Change 471783 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Remove asw-b-eqiad mgmt from DNS

https://gerrit.wikimedia.org/r/471783

Change 471781 merged by Ayounsi:
[operations/puppet@production] Remove asw-b-eqiad from Rancid, Icinga and Smokeping

https://gerrit.wikimedia.org/r/471781

Finally all set here. Thank you all. Opened T208788 for the wipe & unrack.

Change 471783 merged by Ayounsi:
[operations/dns@master] Remove asw-b-eqiad mgmt from DNS

https://gerrit.wikimedia.org/r/471783