Page MenuHomePhabricator

Rack/cable/configure asw2-c-eqiad switch stack
Closed, ResolvedPublic

Description

Similar to T148506.

This is about row C only:

  • Rack and cable the switches according to diagram (blocked on T187118) [Chris]
    rows-abc-eqiad-cabling.png (1×1 px, 73 KB)
  • Connect mgmt/serial [Chris]
  • Check via serial that switches work, ports are configured as down [Arzhel]
  • Stack the switch, upgrade JunOS, initial switch configuration [Arzhel]
  • Add to DNS [Arzhel]
  • Add to LibreNMS & Rancid [Arzhel]
  • Uplinks ports configured [Arzhel]
  • Add to Icinga [Arzhel]

Thursday 22nd, noon Eastern (4pm UTC) 3h (for all 3 rows)

  • Disable interfaces from cr1-eqiad to asw-c
  • Move cr1 router uplinks from asw-c to asw2-c (and document cable IDs if different) [Chris/Arzhel]
xe-2/0/44 -> cr1-eqiad:xe-3/0/2
xe-2/0/45 -> cr1-eqiad:xe-3/1/2
xe-7/0/44 -> cr1-eqiad:xe-4/0/2
xe-7/0/45 -> cr1-eqiad:xe-4/1/2
  • Connect asw2-c with asw-c with 2x10G (and document cable IDs if different) [Chris]
xe-2/0/43 -> asw-c-eqiad:xe-1/1/0
xe-7/0/43 -> asw-c-eqiad:xe-7/0/0
  • Verify traffic is properly flowing though asw2-c
  • Update interfaces descriptions on cr1

___

  • Switch ports configuration to match asw-c (+login announcement) [Arzhel]
  • Solve snowflakes [Chris/Arzhel]
  • Pre populate FPC2, FPC4 and FPC7 (QFX) with copper SFPs matching the current production servers on rack 2, 4 and 7 [Chris]
ge-2/0/2                   kafka-jumbo1004
ge-2/0/3                   db1108
ge-2/0/4                   analytics1074
ge-2/0/6                   labstore1004 eth0
ge-2/0/7                   db1087
ge-2/0/8                   db1088
ge-2/0/9                   db1100
ge-2/0/10                  analytics1065 - no-bw-mon
ge-2/0/11                  analytics1066 - no-bw-mon
ge-2/0/12                  db1101
ge-2/0/13                  db1055
ge-2/0/15                  labstore1001
ge-2/0/17                  db1059
ge-2/0/18                  db1060
ge-2/0/19                  analytics1028 - no-bw-mon
ge-2/0/20                  analytics1029 - no-bw-mon
ge-2/0/21                  analytics1030 - no-bw-mon
ge-2/0/22                  analytics1031 - no-bw-mon
ge-2/0/23                  es1015
ge-2/0/24                  es1016
ge-2/0/26                  analtyics1064
ge-2/0/45                  lvs1001:eth3
ge-2/0/46                  lvs1002:eth3
ge-2/0/47                  lvs1003:eth3
ge-4/0/0                   eventlog1001
ge-4/0/1                   mwlog1001
ge-4/0/2                   logstash1002
ge-4/0/3                   rdb1007
ge-4/0/4                   hafnium
ge-4/0/6                   graphite1001
ge-4/0/8                   neodymium
ge-4/0/9                   rdb1001
ge-4/0/13                  ores1006
ge-4/0/16                  analytics1001 - no-bw-mon
ge-4/0/17                  cobalt
ge-4/0/18                  analytics1003 - no-bw-mon
ge-4/0/22                  ganeti1001
ge-4/0/23                  ganeti1002
ge-4/0/25                  radon
ge-4/0/26                  labsdb1006
ge-4/0/27                  labsdb1007
ge-4/0/28                  restbase1012
ge-4/0/29                  restbase1013
ge-4/0/30                  snapshot1006
ge-4/0/31                  deploy1001
ge-4/0/32                  bast1002
ge-4/0/33                  labsdb1010
ge-4/0/34                  elastic1029  
ge-4/0/36                  elastic1022
ge-4/0/37                  kafka-jumbo1005
ge-7/0/0                   dbproxy1007
ge-7/0/1                   analytics1014 - no-bw-mon
ge-7/0/2                   elastic1051
ge-7/0/3                   dbproxy1008
ge-7/0/4                   dbproxy1009
ge-7/0/5                   analytics1075
ge-7/0/6                   notebook1001
ge-7/0/7                   ganeti1003
ge-7/0/8                   ganeti1004
ge-7/0/9                   ocg1001
ge-7/0/11                  analytics1022 - no-bw-mon
ge-7/0/14                  conf1002
ge-7/0/15                  notebook1004
ge-7/0/17                  pc1005
ge-7/0/18                  rdb1002
ge-7/0/22                  terbium
ge-7/0/26                  labcontrol1002
ge-7/0/31                  scb1003
ge-7/0/32                  polonium
ge-7/0/33                  francium
ge-7/0/34                  lithium
ge-7/0/35                  elastic1052
ge-7/0/36                  wtp1040
ge-7/0/37                  wtp1041
ge-7/0/38                  wtp1042
  • Move 10G servers from C8 to C2/4/7 [Filippo/Chris/Arzhel]
ms-be1036
ms-be1035
ms-be1034
ms-be1025
ms-be1024
ms-fe1008
ms-fe1007
  • "Regarding neodymium the only thing that needs to be done is to remember people to use sarin instead for long maintenance tasks"
  • Depool aluminium and poolcounter1001 [akosiaris]
  • Disable ping offload (ping1001) [Arzhel]
  • Redirect ns0 to baham [Arzhel]
  • Announce read-only time on-wiki to users of frwiki, ruwiki and jawiki with at least 1 week in advance - T194939 [Arzhel/Community-liaisons]
  • Depool DB hosts [jcrespo]

In maintenance window Tuesday May 29th

  • Downtime asw-c hosts in Icinga [Arzhel]
  • Move servers' uplinks from asw-c to asw2-c, round 1 [Chris]
Leave behind: labs*, cp*
es*
db*
analytics*
kafka*
ores*
osm-web*
rdb*
graphite*
mwlog*
eventlog*
ganeti* (see comment bellow above hosted VMs)
restbase*
snapshot*
bast*
deploy*
elastic*
wtp*
kubernetes*
mc*
relforge*
maps*
aqs*
druid*
radon
neodymium
hafnium
cobalt
  • Verify all servers are healthy, monitoring happy
  • Repool depooled servers

At a later date

  • Move remaining servers [Chris]
lvs1001.wikimedia.org
lvs1002.wikimedia.org
labstore1005.eqiad.wmnet
labstore1004.eqiad.wmnet
labmon1001.eqiad.wmnet
labcontrol1002.wikimedia.org
labsdb1011.eqiad.wmnet
labsdb1010.eqiad.wmnet
labcontrol1001.wikimedia.org
labstore1002.eqiad.wmnet
labstore1001.eqiad.wmnet
labsdb1005.eqiad.wmnet
labsdb1004.eqiad.wmnet
labsdb1007.eqiad.wmnet
labsdb1006.eqiad.wmnet

Thursday 26nd, 5pm UTC 1h

  • Failover VRRP master to cr1-eqiad and verify status + traffic shift [Arzhel]
On cr2:
set interfaces ae3 unit 1003 family inet address 208.80.154.67/26 vrrp-group 3 priority 70
set interfaces ae3 unit 1019 family inet address 10.64.32.3/22 vrrp-group 19 priority 70
set interfaces ae3 unit 1022 family inet address 10.64.36.3/24 vrrp-group 22 priority 70
set interfaces ae3 unit 1119 family inet address 10.64.37.3/24 vrrp-group 119 priority 70
set interfaces ae3 unit 1003 family inet6 address 2620:0:861:3:fe00::2/64 vrrp-inet6-group 3 priority 70
set interfaces ae3 unit 1019 family inet6 address 2620:0:861:103:fe00::2/64 vrrp-inet6-group 19 priority 70
set interfaces ae3 unit 1022 family inet6 address 2620:0:861:106:fe00::2/64 vrrp-inet6-group 22 priority 70
set interfaces ae3 unit 1119 family inet6 address 2620:0:861:119:fe00::2/64 vrrp-inet6-group 119 priority 70
On cr1/2:
show vrrp summary -> master/backup
  • Disable cr2-eqiad:ae3 [Arzhel]
  • Move cr2 router uplinks from asw-c to asw2-c (and document cable IDs if different) [Chris/Arzhel]
xe-2/0/46 -> cr2-eqiad:xe-3/0/2
xe-2/0/47 -> cr2-eqiad:xe-3/1/2
xe-7/0/46 -> cr2-eqiad:xe-4/0/2
xe-7/0/47 -> cr2-eqiad:xe-4/1/2
  • Verify connectivity (eg. with cp1045)
  • Enable cr2-eqiad:ae3 [Arzhel]
  • Re-move VRRP master to cr2-eqiad [Arzhel]
  • Update interfaces descriptions [Arzhel]
  • Verify all servers are healthy, monitoring happy

After CP servers are decommed

  • Verify no more traffic on asw-c<->asw2-c link [Arzhel]
  • Disable asw-c<->asw2-c link [Arzhel]
  • Cleanup config, monitoring, DNS, etc.
  • Wipe & unrack asw-c

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

31st isn't ideal for me, let's aim for May 29th if that also works for @Cmjohnson.

Note that if we can't agree on a definitive date, we will pick a hard date to move most of the servers, and deal with the "left behind" servers progressively after.
This is less ideal though.

I'm flying on the 29th. If Chase wants to manage these things without me that's fine with me though :)

I'm flying on the 29th. If Chase wants to manage these things without me that's fine with me though :)

No I think if that's the deal then let's do more date finding or split it up. I am far more comfortable with both of us able to respond. I talked to @ayounsi and it seems like finding another date to handle cloud things and keeping the existing for others is the plan.

The following servers:

mc1012
mc1011
mc1010
mc1009
mc1008
mc1007

should all be decommissioned by now, and definitely don't need any special care.

From a DB point of view, these servers need special care:

db1061 - s6 primary master. We'd need the less downtime possible. Writes to frwiki, jawiki and ruwiki will fail during the downtime.

Also most db hosts will need to be depooled (but that can be done for an extended time) due to Mediawiki bugs with timed out requests: T180918 to avoid the same issues as T156475

ganeti hosts are the housing for multiple VMs. Those will experience an outage during the recabling. Listing them here

aluminium.wikimedia.org      
argon.eqiad.wmnet            
bromine.eqiad.wmnet          
d-i-test.eqiad.wmnet         
darmstadtium.eqiad.wmnet     
dbmonitor1001.wikimedia.org  
debmonitor1001.eqiad.wmnet   
etcd1001.eqiad.wmnet         
etcd1006.eqiad.wmnet         
etherpad1001.eqiad.wmnet     
krypton.eqiad.wmnet          
kubestagetcd1001.eqiad.wmnet 
logstash1009.eqiad.wmnet     
mendelevium.eqiad.wmnet      
mwdebug1001.eqiad.wmnet      
mx1001.wikimedia.org         
ping1001.eqiad.wmnet         
poolcounter1001.eqiad.wmnet  
proton1002.eqiad.wmnet       
puppetboard1001.eqiad.wmnet  
puppetdb1001.eqiad.wmnet     
roentgenium.eqiad.wmnet      
sca1003.eqiad.wmnet          
seaborgium.wikimedia.org

The ones that we would like to avoid even a few secs outage if possible are probably:

  • aluminium (urldownloader)
  • poolcounter1001 (poolcounter)
  • puppetdb1001 (puppetdb)

The first 2 are easy (we can depool both and I 'll upload a change about it), puppetdb not much we can do about but it will just be a small pain, not outage inducing.

One interesting question is ping1001. @ayounsi, how would a few secs/mins downtime impact things ?

Yeah, puppetdb1001 will probably just generate some spam on IRC for failing puppet runs, transient.

Regarding neodymium the only thing that needs to be done is to remember people to use sarin instead for long maintenance tasks few days before, in particular for:

  • DB-related maintenance (alter tables, etc.)
  • reimages
  • long running cumin jobs

Change 433014 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool all row C databases (except s6 master)

https://gerrit.wikimedia.org/r/433014

We should be able to failover logically dbproxy1007,8,9 to its hot spare, too.

Change 433015 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/dns@master] mariadb: Failover dbproxy1007,8 and 9 and make them passive

https://gerrit.wikimedia.org/r/433015

With the only above patches, the only special requirement for us is to handle db1061 (s6 master) on its own separate window- provide a realistic downtime window and be prepared for service failover if that extends beyond the preagreed window.

I would suggest we do NOT disable/depool anything but the obvious outlier in the databases (we already know that timeouts on the databases would cause a serious outage, because of bugs in MediaWiki).

Let's see how resilient is production to this kind of network partition for a limited amount of time.

Specifically, poolcounter1001 not being reachable should *not* cause an outage, but maybe just some errors in the logs and longer latencies. The only thing we really need to check beforehand are the distributed datastores like cassandra and elasticsearch: if we didn't our homework correctly and they're not evenly distributed across rows, it's a problem we need to correct anyways.

Things to watch out for:

  • All lvs primaries for eqiad are in row C
  • row C includes 30 appservers
  • conf1002 is in row C (etcd connections will be interrupted, we know it can cause issues).

Change 433015 merged by Jcrespo:
[operations/dns@master] mariadb: Failover dbproxy1007,8 and 9 and make them passive

https://gerrit.wikimedia.org/r/433015

WRT ms-fe servers (1008 and 1007), please move to asw2 and reallocate to be in two different physical racks.

Ditto for ms-be machines, we'll move them one at a time to asw2 and spread out to different physical racks as much as space allows

Mentioned in SAL (#wikimedia-operations) [2018-05-16T14:49:50Z] <godog> pool ms-fe1007 for asw2 move - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T14:54:46Z] <godog> depool ms-fe1008 for asw2 move - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:10:01Z] <godog> pool ms-fe1008 for asw2 move - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:18:47Z] <godog> move ms-be1024 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:36:15Z] <godog> move ms-be1025 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:49:12Z] <godog> move ms-be1034 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:03:27Z] <godog> move ms-be1035 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:18:02Z] <godog> move ms-be1036 for asw2-c-eqiad - T187962

m1 master is on C, so the following services may go down:

  • bacula
  • etherpadlite
  • librenms
  • puppet
  • racktables
  • rt

Change 435755 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Enable read-only for s6

https://gerrit.wikimedia.org/r/435755

Change 435756 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1093 to master

https://gerrit.wikimedia.org/r/435756

Change 435757 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1061: Upgrade socket location

https://gerrit.wikimedia.org/r/435757

Change 435760 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1093 to master

https://gerrit.wikimedia.org/r/435760

Change 435967 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] url-downloader: Point to actinium

https://gerrit.wikimedia.org/r/435967

Change 435968 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/mediawiki-config@master] Depool poolcounter1001

https://gerrit.wikimedia.org/r/435968

Change 435967 merged by Alexandros Kosiaris:
[operations/dns@master] url-downloader: Point to actinium

https://gerrit.wikimedia.org/r/435967

m1 master is on C, so the following services may go down:

  • etherpadlite

Per @jcrespo's comment, etherpad.wikimedia.org WILL BE unavailable for periods of time during this. Don't rely on it for coordination

Change 435972 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] icinga: Populate additional hostgroups based on LLDP

https://gerrit.wikimedia.org/r/435972

Change 435972 merged by Alexandros Kosiaris:
[operations/puppet@production] icinga: Populate additional hostgroups based on LLDP

https://gerrit.wikimedia.org/r/435972

Change 433014 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool all row C databases (except s6 master)

https://gerrit.wikimedia.org/r/433014

Mentioned in SAL (#wikimedia-operations) [2018-05-29T09:07:30Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool all databases in row C - T187962 (duration: 01m 35s)

Mentioned in SAL (#wikimedia-operations) [2018-05-29T09:16:38Z] <XioNoX> disable ping1001 redirect - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-29T10:55:08Z] <XioNoX> Eqiad row C server move starting - T187962

Change 435755 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Enable read-only for s6

https://gerrit.wikimedia.org/r/435755

Mentioned in SAL (#wikimedia-operations) [2018-05-29T10:59:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Enable read only on s6 T194939 T187962 (duration: 01m 35s)

Change 435757 merged by Marostegui:
[operations/puppet@production] db1061: Upgrade socket location

https://gerrit.wikimedia.org/r/435757

Mentioned in SAL (#wikimedia-operations) [2018-05-29T11:08:39Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Disable read only on s6 T194939 T187962 (duration: 01m 37s)

s6 primary master maintenance completed:

Read only lasted from:

10:59:59 to 11:08:39 (times are in UTC)

Change 435760 abandoned by Marostegui:
db-eqiad.php: Promote db1093 to master

Reason:
Not needed

https://gerrit.wikimedia.org/r/435760

The only thing we really need to check beforehand are the distributed datastores like cassandra and elasticsearch: if we didn't our homework correctly and they're not evenly distributed across rows, it's a problem we need to correct anyways.

I'm late to the conversation... Elastic should not be a major issue, we should be able to loose the full row and not loose any data. With 1/4 of the capacity, we're more than likely to see raising latency though. Elasticsearch will detect even very short network interruption and start moving shards around. We might have alerts if the number of unassigned shards climbs above our threshold, but the cluster should stay yellow.

First and main round of server move done. Went well overall, thanks to everybody who chipped in.

Some notes:

  • Faulty SFP-T for ganeti1004 caused a longer outage for the VMs hosted on that host
  • Most servers didn't alert for the move
  • s6 master move was smooth (8min read only time, included db maintenance)
  • lvs1001 and lvs1002 were overlapping the switches uplinks, this got noticed on time and will be tackled later on
  • db1108 switch's port was miss-configured, but didn't cause much outage until fixed

Will sync up with the cloud team and find a good time to move their boxes. Task description updated with next steps.

8min read only time, included db maintenance

Yes, it was only that long because we had programmed a restart to reuse the read only window for other important pending maintenance, otherwise it would have been much smaller. Thank you.

Mentioned in SAL (#wikimedia-operations) [2018-05-29T13:56:44Z] <XioNoX> rolling back ns0 and ping1001 redirects - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-29T14:19:46Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool all databases in row C - T187962 (duration: 01m 19s)

Change 435756 abandoned by Marostegui:
mariadb: Promote db1093 to master

Reason:
This is not needed anymore

https://gerrit.wikimedia.org/r/435756

Change 435968 abandoned by Alexandros Kosiaris:
Depool poolcounter1001

Reason:
Turns out we did not really need it after all. The sites survived the downtime

https://gerrit.wikimedia.org/r/435968

@Cmjohnson:
Please move lvs1001 from asw-c-eqiad:ge-2/0/45 to asw2-c-eqiad:ge-2/0/27

and (after I'm done de-pooling the host) lvs1002 from asw-c-eqiad:ge-2/0/46 to asw2-c-eqiad:ge-2/0/28

We met today to sync up on moving the remaining lab* servers. Hopefully these days/times al work for @Cmjohnson (I added him to the calendar invites to confirm)

Back on the 26th?  Rack moves, no new IP, recabling.

Chase & Brooke
28th? thursday - make sure chris is around

labstore1004.eqiad.wmnet -- very sensitive to any outage unf
labstore1005.eqiad.wmnet (can fail over from 1004)
labstore1002.eqiad.wmnet (mostly idle, safe to move)
labstore1001.eqiad.wmnet (mostly idle)
\
Andrew & Brooke
29th? - confirm w/ chris
labmon1001.eqiad.wmnet (needs silencing before move)
labcontrol1001.wikimedia.org (needs some silencing/disabling before the move, but don't bother with failover because it'll be quick)
labcontrol1002.wikimedia.org (assuming 1001 comes up fine from its move, moving this doesn't cause outage.  Mustn't happen until 1001 is stable though)


Brooke & Andrew(?) & Manuel/Jaime?
?? July 10th and 11th ? (tuesday - wed?)

labsdb1011.eqiad.wmnet (depool first -- no real impact, but needs slave stop)
labsdb1010.eqiad.wmnet (depool first -- no real impact, but needs slave stop)
labsdb1005.eqiad.wmnet (Some tables don't fail over (toolsdb does) -- requires DBAs)
labsdb1004.eqiad.wmnet (standby for toolsdb and SPoF for postgres/wikilabels so must be coordinated with them)
labsdb1007.eqiad.wmnet (postgres?) maps?
labsdb1006.eqiad.wmnet (postgres?) maps?

To clarify- databases that are depooled do not need to stop replication- if replication goes down it tries to infinitely retry connecting with no possibility of corruption. What would be nice is not to do both 11 and 10 at the same time because that is 2/3s of our total capacity (and 100% of the analytics nodes).

The dates work for me..I accepted the calendar invites

Change 442870 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] WIP labstore: switch labstore1005 to primary in pair

https://gerrit.wikimedia.org/r/442870

Change 442870 merged by Bstorm:
[operations/puppet@production] WIP labstore: switch labstore1005 to primary in pair

https://gerrit.wikimedia.org/r/442870

ayounsi updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-07-26T17:01:35Z] <XioNoX> moving row C vrrp master to cr1 - T187962

Mentioned in SAL (#wikimedia-operations) [2018-07-26T17:58:47Z] <XioNoX> moving row C vrrp master back to cr2 - T187962

ayounsi claimed this task.

All done here.
Opened T208734 for the decommissioning part.