Page MenuHomePhabricator

Network maintenance on row D (databases)
Closed, ResolvedPublic

Description

I have been talking to @ayounsi about the network maintenance that needs to be carried on row D of eqiad (T148506)

There are two things that are probably going to happen while eqiad is on stand by (although this needs to be confirmed 100% yet).

Servers in rack D2 and D8 need to be moved somewhere else, so they need to be powered off.
Rack D8 has no DBs, but D2 has the following db servers:

db1094
db1093
db1092
db1091
es1019

None of the above are masters, so they can be depooled and moved when needed.

The second task is to recable all the servers in row D, meaning that they will lose network connectivity for a while.
The database servers involved are the following (apart from the ones mentioned above):

es1018	
es1017
db1071
db1070	
db1069 (sanitarium)
db1068
db1067	
db1066
db1065
db1064 (sanitarium2 master)
dbstore1002
dbstore1001 (backups server)
db1063
db1062
db1061

The only critical servers are db1069 (sanitarium) and dbstore1001 which are fine anyways if they lose connectivity for some minutes.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-eqiad.php: Repool hosts out for net maintenance
operations/mediawiki-config : masterdb-eqiad.php: Depool hosts that need to be moved

Event Timeline

ayounsi renamed this task from Network maintenance on row D to Network maintenance on row D (databases).Apr 11 2017, 12:29 PM

We want to do some master switchovers while eqiad is on sby, so we'd need to coordinate it too: T162133

@ayounsi have you guys thought when you want do to this? We are trying to organize ourselves with the days eqiad is going to be stand by.

Scheduled date is the 26th (T148506#3171998). I have communication to be sent to ops drafted.

Cool, I will talk to Jaime tomorrow in our weekly meeting and we will try to see how to fit our stuff before/after it.
I will keep you posted - thanks!!

db1068 has been promoted to master (we had to do it kind unexpectedly today) on s4 - it is affected by the recabling but no for the server move. So that is good.

Change 350372 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts that need to be moved

https://gerrit.wikimedia.org/r/350372

Change 350372 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts that need to be moved

https://gerrit.wikimedia.org/r/350372

Mentioned in SAL (#wikimedia-operations) [2017-04-26T08:27:45Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Depool hosts that need to be moved for the network maintenance - T162681 (duration: 02m 25s)

I have downtimed these hosts for 24 hours:

db1094
db1093
db1092
db1091
es1019

Mentioned in SAL (#wikimedia-operations) [2017-04-26T09:16:49Z] <marostegui> Shutdown es1019 for maintenance - T162681

Mentioned in SAL (#wikimedia-operations) [2017-04-26T09:24:31Z] <marostegui> Shutdown db1094, db1093, db1091 for maintenance - T162681

The following hosts are down and ready to be moved anytime (@ayounsi):

es1019
db1094
db1093
db1091

Pending db1092 which is going to be involved in a master switchover(T162133), so it will be done in a bit

Mentioned in SAL (#wikimedia-operations) [2017-04-26T12:56:31Z] <marostegui> Shutdown db1092 for maintenance - https://phabricator.wikimedia.org/T162681

db1092 is finally down too.

@ayounsi all the hosts are back, is it all done then?

Rack move to D7 and D8 are done.
Switch ports configuration for row D is done.
Remaining is to move servers' uplinks from asw to asw2 in the other D racks.

Is that the operation that implies some small connectivity loss?

Change 350518 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool hosts out for net maintenance

https://gerrit.wikimedia.org/r/350518

Change 350518 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool hosts out for net maintenance

https://gerrit.wikimedia.org/r/350518

Mentioned in SAL (#wikimedia-operations) [2017-04-27T07:16:31Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Repool hosts that needed to be moved for the network maintenance - T162681 (duration: 02m 32s)

So I have been talking to @ayounsi and the servers in row D still need to be recabled, so they will be affected by the small outage.
As per his comment on: T148506#3215394

We are rescheduling it at 4pm UTC tomorrow (27th) and expecting it to last 1h max.
Racks D7 and D8 are all set and will not be impacted.

@jcrespo be aware of this for any long running alter table that might be affected.

I have downtimed for 20 hours the above hosts plus the slaves of those masters involved as replication broken will page:

db1095
db1053
db1056
db1059
db1064
db1081
db1084
db1091
db1040
db1026
db1045
db1049
db1070
db1071
db1082
db1087
db1092
db1063
db1028
db1033
db1034
db1039
db1041
db1079
db1086
db1094
db1062
db1022
db1023
db1030
db1037
db1085
db1088
db1093
db1050
db1061
Marostegui moved this task from Next to In progress on the DBA board.Apr 27 2017, 12:30 PM
Marostegui closed this task as Resolved.Apr 28 2017, 5:03 AM
Marostegui claimed this task.

This was all done and nothing else is pending

Marostegui removed Marostegui as the assignee of this task.Apr 28 2017, 5:03 AM