⚓ T187962 Rack/cable/configure asw2-c-eqiad switch stack

Subject	Repo	Branch	Lines +/-
WIP labstore: switch labstore1005 to primary in pair	operations/puppet	production	+2 -2
Depool poolcounter1001	operations/mediawiki-config	master	+1 -1
mariadb: Promote db1093 to master	operations/puppet	production	+5 -5
mariadb: Depool all row C databases (except s6 master)	operations/mediawiki-config	master	+54 -47
db-eqiad.php: Promote db1093 to master	operations/mediawiki-config	master	+2 -3
db1061: Upgrade socket location	operations/puppet	production	+0 -1
db-eqiad.php: Enable read-only for s6	operations/mediawiki-config	master	+1 -1
icinga: Populate additional hostgroups based on LLDP	operations/puppet	production	+41 -1
url-downloader: Point to actinium	operations/dns	master	+1 -1
mariadb: Failover dbproxy1007,8 and 9 and make them passive	operations/dns	master	+2 -2
netops: add asw2-a-eqiad and asw2-c-eqiad	operations/puppet	production	+2 -0

Status	Assigned	Task
Invalid	• ayounsi	T199142 Increase network capacity (2018-19 Q1 Goal)
Resolved	• ayounsi	T187962 Rack/cable/configure asw2-c-eqiad switch stack
Resolved	• Cmjohnson	T191792 Rack and setup db1116 - db1123
Resolved	• jcrespo	T192979 Productionize 8 eqiad hosts
Resolved	BBlack	T195923 rack/setup/install cp1075-cp1090
Resolved	• Cmjohnson	T201174 cp1080 uncorrectable DIMM error slot A5
Resolved	• Cmjohnson	T201175 cp1085 bad DAC/SFP?

31st isn't ideal for me, let's aim for May 29th if that also works for @Cmjohnson.

Note that if we can't agree on a definitive date, we will pick a hard date to move most of the servers, and deal with the "left behind" servers progressively after.
This is less ideal though.

I'm flying on the 29th. If Chase wants to manage these things without me that's fine with me though :)

• ayounsi updated the task description. (Show Details)May 11 2018, 5:11 PM

In T187962#4198886, @Andrew wrote:

I'm flying on the 29th. If Chase wants to manage these things without me that's fine with me though :)

No I think if that's the deal then let's do more date finding or split it up. I am far more comfortable with both of us able to respond. I talked to @ayounsi and it seems like finding another date to handle cloud things and keeping the existing for others is the plan.

The following servers:

mc1012
mc1011
mc1010
mc1009
mc1008
mc1007

should all be decommissioned by now, and definitely don't need any special care.

From a DB point of view, these servers need special care:

db1061 - s6 primary master. We'd need the less downtime possible. Writes to frwiki, jawiki and ruwiki will fail during the downtime.

Also most db hosts will need to be depooled (but that can be done for an extended time) due to Mediawiki bugs with timed out requests: T180918 to avoid the same issues as T156475

ganeti hosts are the housing for multiple VMs. Those will experience an outage during the recabling. Listing them here

aluminium.wikimedia.org      
argon.eqiad.wmnet            
bromine.eqiad.wmnet          
d-i-test.eqiad.wmnet         
darmstadtium.eqiad.wmnet     
dbmonitor1001.wikimedia.org  
debmonitor1001.eqiad.wmnet   
etcd1001.eqiad.wmnet         
etcd1006.eqiad.wmnet         
etherpad1001.eqiad.wmnet     
krypton.eqiad.wmnet          
kubestagetcd1001.eqiad.wmnet 
logstash1009.eqiad.wmnet     
mendelevium.eqiad.wmnet      
mwdebug1001.eqiad.wmnet      
mx1001.wikimedia.org         
ping1001.eqiad.wmnet         
poolcounter1001.eqiad.wmnet  
proton1002.eqiad.wmnet       
puppetboard1001.eqiad.wmnet  
puppetdb1001.eqiad.wmnet     
roentgenium.eqiad.wmnet      
sca1003.eqiad.wmnet          
seaborgium.wikimedia.org

The ones that we would like to avoid even a few secs outage if possible are probably:

aluminium (urldownloader)
poolcounter1001 (poolcounter)
puppetdb1001 (puppetdb)

The first 2 are easy (we can depool both and I 'll upload a change about it), puppetdb not much we can do about but it will just be a small pain, not outage inducing.

One interesting question is ping1001. @ayounsi, how would a few secs/mins downtime impact things ?

Yeah, puppetdb1001 will probably just generate some spam on IRC for failing puppet runs, transient.

Regarding neodymium the only thing that needs to be done is to remember people to use sarin instead for long maintenance tasks few days before, in particular for:

DB-related maintenance (alter tables, etc.)
reimages
long running cumin jobs

Change 433014 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool all row C databases (except s6 master)

https://gerrit.wikimedia.org/r/433014

We should be able to failover logically dbproxy1007,8,9 to its hot spare, too.

Change 433015 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/dns@master] mariadb: Failover dbproxy1007,8 and 9 and make them passive

https://gerrit.wikimedia.org/r/433015

With the only above patches, the only special requirement for us is to handle db1061 (s6 master) on its own separate window- provide a realistic downtime window and be prepared for service failover if that extends beyond the preagreed window.

• ayounsi updated the task description. (Show Details)May 14 2018, 9:50 PM

I would suggest we do NOT disable/depool anything but the obvious outlier in the databases (we already know that timeouts on the databases would cause a serious outage, because of bugs in MediaWiki).

Let's see how resilient is production to this kind of network partition for a limited amount of time.

Specifically, poolcounter1001 not being reachable should *not* cause an outage, but maybe just some errors in the logs and longer latencies. The only thing we really need to check beforehand are the distributed datastores like cassandra and elasticsearch: if we didn't our homework correctly and they're not evenly distributed across rows, it's a problem we need to correct anyways.

Things to watch out for:

All lvs primaries for eqiad are in row C
row C includes 30 appservers
conf1002 is in row C (etcd connections will be interrupted, we know it can cause issues).

Change 433015 merged by Jcrespo:
[operations/dns@master] mariadb: Failover dbproxy1007,8 and 9 and make them passive

https://gerrit.wikimedia.org/r/433015

Mentioned in SAL (#wikimedia-operations) [2018-05-16T14:17:24Z] <godog> depool ms-fe1007 for asw2 move - T187962

WRT ms-fe servers (1008 and 1007), please move to asw2 and reallocate to be in two different physical racks.

Ditto for ms-be machines, we'll move them one at a time to asw2 and spread out to different physical racks as much as space allows

Mentioned in SAL (#wikimedia-operations) [2018-05-16T14:49:50Z] <godog> pool ms-fe1007 for asw2 move - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T14:54:46Z] <godog> depool ms-fe1008 for asw2 move - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:10:01Z] <godog> pool ms-fe1008 for asw2 move - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:18:47Z] <godog> move ms-be1024 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:36:15Z] <godog> move ms-be1025 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T15:49:12Z] <godog> move ms-be1034 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:03:27Z] <godog> move ms-be1035 for asw2-c-eqiad - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-16T16:18:02Z] <godog> move ms-be1036 for asw2-c-eqiad - T187962

• ayounsi updated the task description. (Show Details)May 16 2018, 4:41 PM

• jcrespo updated the task description. (Show Details)May 17 2018, 7:21 AM

• ayounsi mentioned this in T194939: Announce read-only time on-wiki to users of frwiki, ruwiki and jawiki (May 29, 2018).May 18 2018, 12:34 PM

• ayounsi updated the task description. (Show Details)

m1 master is on C, so the following services may go down:

bacula
etherpadlite
librenms
puppet
racktables
rt

Framawiki subscribed.May 23 2018, 11:14 PM

Marostegui mentioned this in T195595: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart.May 25 2018, 3:11 PM

Change 435755 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Enable read-only for s6

https://gerrit.wikimedia.org/r/435755

Change 435756 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1093 to master

https://gerrit.wikimedia.org/r/435756

Change 435757 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1061: Upgrade socket location

https://gerrit.wikimedia.org/r/435757

Change 435760 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1093 to master

https://gerrit.wikimedia.org/r/435760

Change 435967 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] url-downloader: Point to actinium

https://gerrit.wikimedia.org/r/435967

Change 435968 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/mediawiki-config@master] Depool poolcounter1001

https://gerrit.wikimedia.org/r/435968

Change 435967 merged by Alexandros Kosiaris:
[operations/dns@master] url-downloader: Point to actinium

https://gerrit.wikimedia.org/r/435967

In T187962#4224809, @jcrespo wrote:

m1 master is on C, so the following services may go down:

etherpadlite

Per @jcrespo's comment, etherpad.wikimedia.org WILL BE unavailable for periods of time during this. Don't rely on it for coordination

Change 435972 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] icinga: Populate additional hostgroups based on LLDP

https://gerrit.wikimedia.org/r/435972

Change 435972 merged by Alexandros Kosiaris:
[operations/puppet@production] icinga: Populate additional hostgroups based on LLDP

https://gerrit.wikimedia.org/r/435972

Change 433014 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool all row C databases (except s6 master)

https://gerrit.wikimedia.org/r/433014

Marostegui updated the task description. (Show Details)May 29 2018, 9:06 AM

Mentioned in SAL (#wikimedia-operations) [2018-05-29T09:07:30Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool all databases in row C - T187962 (duration: 01m 35s)

Mentioned in SAL (#wikimedia-operations) [2018-05-29T09:16:38Z] <XioNoX> disable ping1001 redirect - T187962

Mentioned in SAL (#wikimedia-operations) [2018-05-29T09:20:44Z] <XioNoX> redirect ns0 to baham - T187962

• ayounsi updated the task description. (Show Details)May 29 2018, 9:23 AM

• ayounsi updated the task description. (Show Details)May 29 2018, 10:33 AM

Mentioned in SAL (#wikimedia-operations) [2018-05-29T10:55:08Z] <XioNoX> Eqiad row C server move starting - T187962

Change 435755 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Enable read-only for s6

https://gerrit.wikimedia.org/r/435755

Mentioned in SAL (#wikimedia-operations) [2018-05-29T10:59:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Enable read only on s6 T194939 T187962 (duration: 01m 35s)

Change 435757 merged by Marostegui:
[operations/puppet@production] db1061: Upgrade socket location

https://gerrit.wikimedia.org/r/435757

Mentioned in SAL (#wikimedia-operations) [2018-05-29T11:08:39Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Disable read only on s6 T194939 T187962 (duration: 01m 37s)

s6 primary master maintenance completed:

Read only lasted from:

10:59:59 to 11:08:39 (times are in UTC)

Change 435760 abandoned by Marostegui:
db-eqiad.php: Promote db1093 to master

Reason:
Not needed

https://gerrit.wikimedia.org/r/435760

In T187962#4206574, @Joe wrote:

The only thing we really need to check beforehand are the distributed datastores like cassandra and elasticsearch: if we didn't our homework correctly and they're not evenly distributed across rows, it's a problem we need to correct anyways.

I'm late to the conversation... Elastic should not be a major issue, we should be able to loose the full row and not loose any data. With 1/4 of the capacity, we're more than likely to see raising latency though. Elasticsearch will detect even very short network interruption and start moving shards around. We might have alerts if the number of unassigned shards climbs above our threshold, but the cluster should stay yellow.

• ayounsi updated the task description. (Show Details)May 29 2018, 1:38 PM

• ayounsi updated the task description. (Show Details)May 29 2018, 1:40 PM

First and main round of server move done. Went well overall, thanks to everybody who chipped in.

Some notes:

Faulty SFP-T for ganeti1004 caused a longer outage for the VMs hosted on that host
Most servers didn't alert for the move
s6 master move was smooth (8min read only time, included db maintenance)
lvs1001 and lvs1002 were overlapping the switches uplinks, this got noticed on time and will be tackled later on
db1108 switch's port was miss-configured, but didn't cause much outage until fixed

Will sync up with the cloud team and find a good time to move their boxes. Task description updated with next steps.

8min read only time, included db maintenance

Yes, it was only that long because we had programmed a restart to reuse the read only window for other important pending maintenance, otherwise it would have been much smaller. Thank you.

Mentioned in SAL (#wikimedia-operations) [2018-05-29T13:56:44Z] <XioNoX> rolling back ns0 and ping1001 redirects - T187962

• ayounsi updated the task description. (Show Details)May 29 2018, 2:08 PM

Mentioned in SAL (#wikimedia-operations) [2018-05-29T14:19:46Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool all databases in row C - T187962 (duration: 01m 19s)

Change 435756 abandoned by Marostegui:
mariadb: Promote db1093 to master

Reason:
This is not needed anymore

https://gerrit.wikimedia.org/r/435756

Change 435968 abandoned by Alexandros Kosiaris:
Depool poolcounter1001

Reason:
Turns out we did not really need it after all. The sites survived the downtime

https://gerrit.wikimedia.org/r/435968

@Cmjohnson:
Please move lvs1001 from asw-c-eqiad:ge-2/0/45 to asw2-c-eqiad:ge-2/0/27

and (after I'm done de-pooling the host) lvs1002 from asw-c-eqiad:ge-2/0/46 to asw2-c-eqiad:ge-2/0/28

Mentioned in SAL (#wikimedia-operations) [2018-06-06T15:51:07Z] <XioNoX> disable pybal on lvs1002 - T187962

• ayounsi updated the task description. (Show Details)Jun 6 2018, 4:08 PM

We met today to sync up on moving the remaining lab* servers. Hopefully these days/times al work for @Cmjohnson (I added him to the calendar invites to confirm)

Back on the 26th?  Rack moves, no new IP, recabling.

Chase & Brooke
28th? thursday - make sure chris is around

labstore1004.eqiad.wmnet -- very sensitive to any outage unf
labstore1005.eqiad.wmnet (can fail over from 1004)
labstore1002.eqiad.wmnet (mostly idle, safe to move)
labstore1001.eqiad.wmnet (mostly idle)
\
Andrew & Brooke
29th? - confirm w/ chris
labmon1001.eqiad.wmnet (needs silencing before move)
labcontrol1001.wikimedia.org (needs some silencing/disabling before the move, but don't bother with failover because it'll be quick)
labcontrol1002.wikimedia.org (assuming 1001 comes up fine from its move, moving this doesn't cause outage.  Mustn't happen until 1001 is stable though)


Brooke & Andrew(?) & Manuel/Jaime?
?? July 10th and 11th ? (tuesday - wed?)

labsdb1011.eqiad.wmnet (depool first -- no real impact, but needs slave stop)
labsdb1010.eqiad.wmnet (depool first -- no real impact, but needs slave stop)
labsdb1005.eqiad.wmnet (Some tables don't fail over (toolsdb does) -- requires DBAs)
labsdb1004.eqiad.wmnet (standby for toolsdb and SPoF for postgres/wikilabels so must be coordinated with them)
labsdb1007.eqiad.wmnet (postgres?) maps?
labsdb1006.eqiad.wmnet (postgres?) maps?

To clarify- databases that are depooled do not need to stop replication- if replication goes down it tries to infinitely retry connecting with no possibility of corruption. What would be nice is not to do both 11 and 10 at the same time because that is 2/3s of our total capacity (and 100% of the analytics nodes).

• Bstorm subscribed.Jun 14 2018, 3:46 PM

• Bstorm mentioned this in T197244: Move analytics wiki replica cluster for switch and data center reconfigure.Jun 14 2018, 3:59 PM

• Bstorm mentioned this in T197245: Move toolsdb and wikilabels cluster servers for datacenter reconfiguration.Jun 14 2018, 4:07 PM

• Bstorm mentioned this in T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.Jun 14 2018, 4:24 PM

The dates work for me..I accepted the calendar invites

Change 442870 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] WIP labstore: switch labstore1005 to primary in pair

https://gerrit.wikimedia.org/r/442870

Change 442870 merged by Bstorm:
[operations/puppet@production] WIP labstore: switch labstore1005 to primary in pair

https://gerrit.wikimedia.org/r/442870

• ayounsi added a parent task: T199142: Increase network capacity (2018-19 Q1 Goal).Jul 9 2018, 6:31 PM

• ayounsi mentioned this in T183585: Rack/cable/configure asw2-b-eqiad switch stack.Jul 14 2018, 12:29 AM

• ayounsi updated the task description. (Show Details)Jul 14 2018, 12:35 AM

• ayounsi added a subtask: T195923: rack/setup/install cp1075-cp1090.Jul 16 2018, 3:30 PM

• ayounsi updated the task description. (Show Details)Jul 23 2018, 4:05 PM

• ayounsi updated the task description. (Show Details)Jul 26 2018, 4:42 PM

• ayounsi updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-07-26T17:01:35Z] <XioNoX> moving row C vrrp master to cr1 - T187962