Switchover s1 master db1057 -> db1052
Closed, ResolvedPublic

Description

We need to switch s1 master to mitigate: T155875
Let's use this ticket as a meta ticket to discuss and note what needs to be done to switch it.
This task needs the following tasks to be complete before we can start preparing it:

T156004
T156005
T156006

Change 333970 had a related patch set uploaded (by Jcrespo):
mariadb: Set binlog_format to STATEMENT for db1052

https://gerrit.wikimedia.org/r/333970

Change 333970 merged by Jcrespo:
mariadb: Set binlog_format to STATEMENT for db1052

https://gerrit.wikimedia.org/r/333970

I have upgraded all packages except wmf-mariadb10 and restarted the server for kernel update.

Change 334008 had a related patch set uploaded (by Jcrespo):
mariadb: Repool db1052 after maintenance

https://gerrit.wikimedia.org/r/334008

Change 334008 merged by jenkins-bot:
mariadb: Repool db1052 after maintenance

https://gerrit.wikimedia.org/r/334008

Change 334030 had a related patch set uploaded (by Marostegui):
site.pp: Change active master for enwiki

https://gerrit.wikimedia.org/r/334030

This will be happening Thursday 25th at 07:00 UTC

Change 334242 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Change s1 master

https://gerrit.wikimedia.org/r/334242

Change 334243 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1052

https://gerrit.wikimedia.org/r/334243

Change 334243 merged by jenkins-bot:
db-eqiad.php: Depool db1052

https://gerrit.wikimedia.org/r/334243

Mentioned in SAL (#wikimedia-operations) [2017-01-26T06:49:45Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1052 - T156008 (duration: 00m 31s)

Change 334242 merged by jenkins-bot:
db-eqiad.php: Change s1 master

https://gerrit.wikimedia.org/r/334242

Change 334030 merged by Marostegui:
site.pp: Change active master for enwiki

https://gerrit.wikimedia.org/r/334030

Mentioned in SAL (#wikimedia-operations) [2017-01-26T07:32:55Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Change s1 master to db1057 - T156008 (duration: 00m 20s)

This has happened already.

Times in UTC:

Preparation of all the code, topology changes etc: 06:30-07:30
read only on: 07:30:40
do all the necessary checks to make sure we were good
start to deploy mediawiki config change: 07:32:34
finished deploying mediawiki config change: 07:32:54

Total read only time: 02:24.

We are now going to start all the clean up work.

Thanks @Joe and @Volans for helping out!
If anyone see something wrong, please let us know.

Mentioned in SAL (#wikimedia-operations) [2017-01-26T08:48:57Z] <marostegui> Change db1069 to replicate from the new s1 master db1052 - T156008

Mentioned in SAL (#wikimedia-operations) [2017-01-26T08:57:41Z] <marostegui> Change db1047 to replicate from the new s1 master db1052 - T156008

Mentioned in SAL (#wikimedia-operations) [2017-01-26T09:04:31Z] <marostegui> Change dbstore1002 to replicate from the new s1 master db1052 - T156008

recap of the cleanup work:

dns changed for s1-master.eqiad.wmnet
multisource slaves changed (only pending dbstore1001): db1047, db1069,dbstore1002
replication db1057 -> db1052 cleaned up
gtid enabled on db1057

Pending:
change dbstore1001 to replicate from db1052 once it caught up
enable semisync on db1052
disable db1057 as true on site.pp?

anything else you can see @jcrespo?

Change 334254 had a related patch set uploaded (by Marostegui):
s1.hosts: db1052 is the new master

https://gerrit.wikimedia.org/r/334254

Mentioned in SAL (#wikimedia-operations) [2017-01-26T09:39:33Z] <marostegui> Enable semi-sync replication on db1052 (s1 master) - T156008

Change 334256 had a related patch set uploaded (by Jcrespo):
mariadb: Move db1057 to be a regular slave on config after switch

https://gerrit.wikimedia.org/r/334256

Change 334256 merged by Jcrespo:
mariadb: Move db1057 to be a regular slave on config after switch

https://gerrit.wikimedia.org/r/334256

Mentioned in SAL (#wikimedia-operations) [2017-01-26T09:54:59Z] <marostegui> Disable semi-sync on db1057 old s1 master - https://phabricator.wikimedia.org/T156008

Change 334259 had a related patch set uploaded (by Jcrespo):
prometheus-mysql-exporter: Change db1052 to be s1-master

https://gerrit.wikimedia.org/r/334259

Change 334254 merged by jenkins-bot:
s1.hosts: db1052 is the new master

https://gerrit.wikimedia.org/r/334254

Change 334259 merged by Jcrespo:
prometheus-mysql-exporter: Change db1052 to be s1-master

https://gerrit.wikimedia.org/r/334259

only pending:

  • change dbstore1001 to replicate from db1052
jcrespo claimed this task.Jan 31 2017, 3:39 PM
jcrespo closed this task as "Resolved".Jan 31 2017, 4:52 PM
jcrespo reassigned this task from jcrespo to Marostegui.

I chhanged the master of dbstore1001. Resolving now, but let's monitor dbstore1001 to make sure nothing broke (because its delayed replication it may not alert immediately).

Mentioned in SAL (#wikimedia-operations) [2017-02-01T14:25:28Z] <jynus> dropping and replacing events on db1057 - db1052 T156008

I chhanged the master of dbstore1001. Resolving now, but let's monitor dbstore1001 to make sure nothing broke (because its delayed replication it may not alert immediately).

So far no replication errors! Your archeology work is a success :-).