Page MenuHomePhabricator

DB meta task for next DC failover issues
Closed, ResolvedPublic

Description

This is just a meta task to group and track all the DB stuff that might need to happen during the next DC failover.
Not sorted in any way, just grouping them here for now.

Switchover dates:

Mediawiki: Wednesday, September 12th 2018: 14:00 UTC

Switchback:

Mediawiki: Wednesday, October 10th 2018: 14:00 UTC

DB planning: https://wikitech.wikimedia.org/wiki/Switch_Datacenter/planned_db_maintenance#2018_Switch_Datacenter

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui renamed this task from Meta task for next DC failover issues to DB meta task for next DC failover issues.Mar 16 2018, 7:56 AM
Marostegui added a subtask: Restricted Task.
Marostegui added a subtask: Restricted Task.

Change 457847 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Fix DB configuration in preparation for dc switchover

https://gerrit.wikimedia.org/r/457847

Missing partitions on codfw:

db2085:3311:enwiki:logging
db2088:3311:enwiki:logging

db2088:3312:bgwiktionary:revision
db2088:3312:bgwiktionary:logging
db2088:3312:eowiki:revision
db2088:3312:eowiki:logging
db2088:3312:idwiki:revision
db2088:3312:idwiki:logging

db2091:3312:bgwiktionary:revision
db2091:3312:bgwiktionary:logging
db2091:3312:eowiki:revision
db2091:3312:eowiki:logging
db2091:3312:idwiki:revision
db2091:3312:idwiki:logging

db2086:3317:frwiktionary:revision
db2086:3317:frwiktionary:logging
db2087:3317:frwiktionary:revision
db2087:3317:frwiktionary:logging

The only worrying are enwiki, the other may not be in eqiad either (I haven't checked).

I have checked bgwitionary, eowiki, idwiki and frwiktionary and they do not exist on eqiad either.

Mentioned in SAL (#wikimedia-operations) [2018-09-04T11:11:50Z] <jynus> stopping replication and running partitioning on logging on db1085:3311 T189107

I have checked that no codfw hosts have notifications disabled on puppet or on icinga itself.

Mentioned in SAL (#wikimedia-operations) [2018-09-04T14:01:59Z] <jynus> stopping replication and running partitioning on logging on db1088:3311 T189107

Change 458125 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Set s5 section in read-write, codfw should be still in ro

https://gerrit.wikimedia.org/r/458125

Change 458125 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Set s5 section in read-write, codfw should be still in ro

https://gerrit.wikimedia.org/r/458125

Change 458128 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Set all codfw sections as read-only, codfw is still in ro

https://gerrit.wikimedia.org/r/458128

Change 458128 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Set all codfw sections as read-only, codfw is still in ro

https://gerrit.wikimedia.org/r/458128

Change 457847 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Fix DB configuration in preparation for dc switchover

https://gerrit.wikimedia.org/r/457847

Mentioned in SAL (#wikimedia-operations) [2018-09-06T08:06:16Z] <marostegui> Enable replication codfw -> eqiad on s5,s6,s2 - T189107

Mentioned in SAL (#wikimedia-operations) [2018-09-06T08:17:06Z] <marostegui> Enable replication codfw -> eqiad on s1,s3,s4 - T189107

Mentioned in SAL (#wikimedia-operations) [2018-09-06T08:25:07Z] <marostegui> Enable replication codfw -> eqiad on s7,s8,x1 - T189107

Mentioned in SAL (#wikimedia-operations) [2018-09-06T08:32:05Z] <marostegui> Enable replication codfw -> eqiad on es2,es3 - T189107

Replication codfw -> eqiad has been enabled on s1-s8,x1,es2,es3

Mentioned in SAL (#wikimedia-operations) [2018-09-11T10:37:41Z] <marostegui> Disable GTID on all codfw masters (sX, x1, esX) (not in db2040 as it is not enabled there) T189107

The following masters had GTID enabled - I have disabled it:

db2048
                   Using_Gtid: Slave_Pos
db2035
                   Using_Gtid: Slave_Pos
db2043
                   Using_Gtid: Slave_Pos
db2051
                   Using_Gtid: Slave_Pos
db2052
                   Using_Gtid: Slave_Pos
db2039
                   Using_Gtid: Slave_Pos
db2040
                   Using_Gtid: No
db2045
                   Using_Gtid: Slave_Pos
db2034
                   Using_Gtid: Slave_Pos
es2016
                   Using_Gtid: Slave_Pos
es2017
                   Using_Gtid: Slave_Pos

So we are now running without it:

db2048
                   Using_Gtid: No
db2035
                   Using_Gtid: No
db2043
                   Using_Gtid: No
db2051
                   Using_Gtid: No
db2052
                   Using_Gtid: No
db2039
                   Using_Gtid: No
db2040
                   Using_Gtid: No
db2045
                   Using_Gtid: No
db2034
                   Using_Gtid: No
es2016
                   Using_Gtid: No
es2017
                   Using_Gtid: No

Mentioned in SAL (#wikimedia-operations) [2018-09-13T08:01:06Z] <marostegui> Disconnect replication eqiad -> codfw on s1-s8, x1, es2, es3 - T189107

Replication has been disconnected from eqiad to codfw:

root@neodymium:/home/marostegui# for i in db2048 db2035 db2043 db2051 db2052 db2039 db2040 db2045 db2034 es2016 es2017; do mysql.py -h$i -e "show slave status\G" ; done
root@neodymium:/home/marostegui#

Mentioned in SAL (#wikimedia-operations) [2018-09-13T09:44:43Z] <marostegui> Enable GTID on eqiad masters - T189107

GTID enabled on all eqiad masters but db1071 (s8) and db1068 (s4) as they are currently running a big alter.

Marostegui closed subtask Restricted Task as Resolved.Sep 14 2018, 8:10 AM

T203565 and T186188 are no longer tasks that are blocked/depend on the DC failover - removing them as subtasks

An update on this.
We have pretty much completed all the tasks we had scheduled for the failover. We are now advancing on other tasks, to complete them faster.
What we have pending is: T148507 which is only blocked on one host, which will, presumably done during the network maintenance so we only have downtime once.

Regarding the other subtask: T184267 - we have upgraded all the "difficult" eqiad hosts, meaning, core masters. The rest of hosts will be done slowly but in a steady way :-)

Marostegui closed subtask Restricted Task as Resolved.Oct 4 2018, 4:09 PM
Marostegui claimed this task.

All the tasks we scheduled to do whilst eqiad was passive, were done!.
We also included T184805 on a last minute task, which was also done and only clean up is left. So I am going to close this meta task.

Thanks everyone for getting so much work done in eqiad!

MoritzMuehlenhoff closed subtask Restricted Task as Resolved.Jun 24 2019, 4:05 PM