Failover s4 primary master: db1068 to db1081
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Jun 3 2019, 5:40 AM

Description

db1068 needs to be failed over to db1081 due to:

db1068 having memory issues T213664
db1068 needs to be decommissioned T217396

Scheduled read-only day:
Date: Wed 19th June
Time: 05:00 AM UTC - 05:30 AM UTC (we expect not to use the full 30 minutes window)

Details

Subject	Repo	Branch	Lines +/-
wmnet: Update s4-master CNAME	operations/dns	master	+1 -1
db-eqiad.php: Promote db1081 to s4 master	operations/mediawiki-config	master	+2 -2
db-eqiad.php: Set s4 in read only	operations/mediawiki-config	master	+1 -1
mariadb: Promote db1081 to s4 master	operations/puppet	production	+5 -5
db-eqiad.php: Depool db1081	operations/mediawiki-config	master	+1 -1
db-eqiad.php: Depool db1121	operations/mediawiki-config	master	+4 -4
db-eqiad.php: Depool db1084	operations/mediawiki-config	master	+2 -2
db-eqiad.php: Depool db1097	operations/mediawiki-config	master	+12 -12
db-eqiad.php: Remove db1081 from API	operations/mediawiki-config	master	+1 -2
db-eqiad.php: Add db1138 to API in s4	operations/mediawiki-config	master	+2 -1
db-eqiad.php: Depool db1103	operations/mediawiki-config	master	+12 -12
db-eqiad.php: Depool db1081	operations/mediawiki-config	master	+3 -3

Related Objects
Search...

Status	Assigned	Task
Resolved	RLazarus	T243314 FY2020-2021 Q1 DC switchover and switchback
Resolved	RLazarus	T243316 FY2020-2021 Q1 eqiad -> codfw switchover
Resolved	• Marostegui	T186188 Failover DB masters in row D
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	• Marostegui	T220170 Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment
Resolved	• Marostegui	T217396 Decommission db1061-db1073
Resolved	• Marostegui	T224852 Failover s4 primary master: db1068 to db1081
Resolved	Johan	T224516 Database primary master failover on s4 (commonswiki)

Event Timeline

• Marostegui triaged this task as Medium priority.Jun 3 2019, 5:40 AM

• Marostegui created this task.

• Marostegui added a subtask: T224516: Database primary master failover on s4 (commonswiki).

• Marostegui updated the task description. (Show Details)

codfw hosts have been all upgraded to 10.1.39

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:45:26Z] <marostegui> Upgrade mariadb on dbstore1004 - T224852

• Marostegui added a parent task: T220170: Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment.Jun 3 2019, 5:50 AM

Change 513947 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/513947

gerritbot added a project: Patch-For-Review.Jun 3 2019, 5:55 AM

Change 513947 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/513947

Mentioned in SAL (#wikimedia-operations) [2019-06-03T06:04:52Z] <marostegui> Stop MySQL on db1081 for upgrade - T224852

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2019, 6:10 AM

• Marostegui added a subscriber: elukey.Jun 3 2019, 6:20 AM

Change 513956 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103

https://gerrit.wikimedia.org/r/513956

gerritbot added a project: Patch-For-Review.Jun 3 2019, 7:26 AM

Change 513956 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103

https://gerrit.wikimedia.org/r/513956

Mentioned in SAL (#wikimedia-operations) [2019-06-03T07:29:55Z] <marostegui> Stop MySQL on db1103 (s2 and s4) for upgrade T224852

• jcrespo added a parent task: T186188: Failover DB masters in row D.Jun 3 2019, 7:30 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-03T07:48:50Z] <marostegui> Repool db1103 after upgrade T224852

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2019, 8:10 AM

Change 514000 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add db1138 to API in s4

https://gerrit.wikimedia.org/r/514000

gerritbot added a project: Patch-For-Review.Jun 3 2019, 1:30 PM

Change 514000 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add db1138 to API in s4

https://gerrit.wikimedia.org/r/514000

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2019, 2:11 PM

Change 514208 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Remove db1081 from API

https://gerrit.wikimedia.org/r/514208

gerritbot added a project: Patch-For-Review.Jun 4 2019, 4:55 AM

Change 514208 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Remove db1081 from API

https://gerrit.wikimedia.org/r/514208

Maintenance_bot removed a project: Patch-For-Review.Jun 4 2019, 5:10 AM

Change 514210 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097

https://gerrit.wikimedia.org/r/514210

Change 514210 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097

https://gerrit.wikimedia.org/r/514210

Mentioned in SAL (#wikimedia-operations) [2019-06-04T05:40:47Z] <marostegui> Stop MySQL on db1091 for MySQL upgrade T224852

Maintenance_bot removed a project: Patch-For-Review.Jun 4 2019, 6:10 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:07:19Z] <marostegui> Upgrade Mysql on labsdb1012 - T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:31:42Z] <marostegui> Stop MySQL on db1125 (sanitarium) s2,s4,s6,s7 to upgrade mysql - T224852

Change 514419 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1084

https://gerrit.wikimedia.org/r/514419

gerritbot added a project: Patch-For-Review.Jun 5 2019, 5:40 AM

Change 514419 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1084

https://gerrit.wikimedia.org/r/514419

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:49:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1084 for upgrade T224852 (duration: 01m 06s)

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:49:22Z] <marostegui> Upgrade MySQL on db1084 T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:57:26Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1084 after upgrade T224852 (duration: 00m 55s)

Maintenance_bot removed a project: Patch-For-Review.Jun 5 2019, 6:10 AM

Change 514652 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121

https://gerrit.wikimedia.org/r/514652

gerritbot added a project: Patch-For-Review.Jun 6 2019, 6:20 AM

Change 514652 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121

https://gerrit.wikimedia.org/r/514652

Mentioned in SAL (#wikimedia-operations) [2019-06-06T06:23:50Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1121 for upgrade T224852 (duration: 00m 55s)

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2019, 7:10 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-06T07:20:36Z] <marostegui> Stop MySQL on db1121 for upgrade, this will generate lag on labs hosts for s6 - T224852

All hosts in codfw are now running 10.1.39 so we are ready for the failover from that front.

Mentioned in SAL (#wikimedia-operations) [2019-06-06T07:35:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1121 after upgrade T224852 (duration: 00m 53s)

• jcrespo mentioned this in T224805: db1062 (s7 db primary master) disk with predictive failure.Jun 6 2019, 3:57 PM

• Marostegui mentioned this in T222978: Compress and defragment tables on labsdb hosts.Jun 12 2019, 10:12 AM

Change 517360 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s4-master CNAME

https://gerrit.wikimedia.org/r/517360

gerritbot added a project: Patch-For-Review.Jun 17 2019, 5:38 AM

Change 517361 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517361

Change 517362 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set s4 in read only

https://gerrit.wikimedia.org/r/517362

Change 517363 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517363

Mentioned in SAL (#wikimedia-operations) [2019-06-18T17:38:01Z] <jynus> testing switchover automation on es2001/es2002 T224852

Testing went as expected:

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002
Starting preflight checks...
[ERROR]: Initial read_only status check failed: original master read_only: 1 / original slave read_only: 1

It errored out because both hosts were in read only.

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
13923 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.33hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 13923 at es2001.codfw.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 13923' -----                                                                                 
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.36hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 13923'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.2979440689086914 seconds
Servers sync at master: es2001-bin.000016:76967 slave: es2002-bin.000002:62473
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
[ERROR]: We could not start replicating towards the original master

It errored out because lack of open port in the es2001 -> es2002 direction (reverse of replication).

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
================                                                                                                        
PASS:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
FAIL:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.33hosts/s]     
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2001.codfw.wmnet
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2001.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[WARNING]: Could not find a pt-heartbeat process to kill, using heartbeat table to determine the section
Setting up original master as read-only
Slave caught up to the master after waiting 0.2981760501861572 seconds
Servers sync at master: es2001-bin.000016:76967 slave: es2002-bin.000002:62473
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section es4 at es2002.codfw.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                             
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.35hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
 1777 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.34hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at es2002.codfw.wmnet running with PID 1777
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

It was successful, but note the warning: because there was no pt-heartbeat, it used the table to determine the section, which is not as reliable.

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2002 es2001
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
 1777 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.37hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 1777 at es2002.codfw.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 1777' -----                                                                                  
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.41hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 1777'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.2967565059661865 seconds
Servers sync at master: es2002-bin.000002:133806 slave: es2001-bin.000016:134389
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section es4 at es2001.codfw.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                             
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.36hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
14728 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.37hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at es2001.codfw.wmnet running with PID 14728
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

Last run was fully successfully.

The thing to stress (even if it is a reminder) is that --skip-slave-move must be used (I actually "fixed" it, but it is not properly tested yet), and that HEAD has a limitation that if it is run from a local directory, RemoteExecution.py must remove the 'wmfmariadbpy.' namespace. (I have it like that on my home dir on both datacenters).

Reviewed all patches, only commented on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/517363 (but +1ed too).

Running a last compare.py just to be super-safe. Will also check we have a fresh s4 snapshot in a few hours.

Mentioned in SAL (#wikimedia-operations) [2019-06-18T18:20:41Z] <jynus> running data compare on s4 (commons) databases T224852

Thanks for all the checks!
I will depool db1081 early in the morning, good idea :)

Change 517787 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/517787

Change 517787 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/517787

Mentioned in SAL (#wikimedia-operations) [2019-06-19T04:19:53Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1081 T224852 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-06-19T04:28:22Z] <marostegui> Starting pre-steps for the s4 failover that will happen at 05:00 UTC - T224852

Change 517361 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517361

Change 517362 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s4 in read only

https://gerrit.wikimedia.org/r/517362

Change 517363 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517363

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:00:16Z] <marostegui> Starting s4 failover from db1068 to db1081 - T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:01:02Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s4 on read-only T224852 (duration: 00m 34s)

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:02:25Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Switchover s4 master eqiad from db1068 to db1081 T224852 (duration: 00m 33s)

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:03:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove s4 ready only T224852 (duration: 00m 33s)