Page MenuHomePhabricator

Failover s4 primary master: db1068 to db1081
Closed, ResolvedPublic

Description

db1068 needs to be failed over to db1081 due to:

  1. db1068 having memory issues T213664
  2. db1068 needs to be decommissioned T217396

Scheduled read-only day:
Date: Wed 19th June
Time: 05:00 AM UTC - 05:30 AM UTC (we expect not to use the full 30 minutes window)

Details

Related Gerrit Patches:
operations/dns : masterwmnet: Update s4-master CNAME
operations/mediawiki-config : masterdb-eqiad.php: Promote db1081 to s4 master
operations/mediawiki-config : masterdb-eqiad.php: Set s4 in read only
operations/puppet : productionmariadb: Promote db1081 to s4 master
operations/mediawiki-config : masterdb-eqiad.php: Depool db1081
operations/mediawiki-config : masterdb-eqiad.php: Depool db1121
operations/mediawiki-config : masterdb-eqiad.php: Depool db1084
operations/mediawiki-config : masterdb-eqiad.php: Depool db1097
operations/mediawiki-config : masterdb-eqiad.php: Remove db1081 from API
operations/mediawiki-config : masterdb-eqiad.php: Add db1138 to API in s4
operations/mediawiki-config : masterdb-eqiad.php: Depool db1103
operations/mediawiki-config : masterdb-eqiad.php: Depool db1081

Event Timeline

Marostegui triaged this task as Medium priority.Jun 3 2019, 5:40 AM
Marostegui created this task.
Marostegui updated the task description. (Show Details)

codfw hosts have been all upgraded to 10.1.39

Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:45:26Z] <marostegui> Upgrade mariadb on dbstore1004 - T224852

Marostegui moved this task from Triage to In progress on the DBA board.EditedJun 3 2019, 5:46 AM

eqiad hosts that need upgrade

  • db1081 candidate master
  • db1084
  • db1091
  • db1097
  • db1103
  • db1121 sanitarium master
  • db1125 sanitarium
  • labsdb1012 (coordinating with @elukey)
  • dbstore1004

Change 513947 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/513947

Change 513947 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/513947

Mentioned in SAL (#wikimedia-operations) [2019-06-03T06:04:52Z] <marostegui> Stop MySQL on db1081 for upgrade - T224852

Change 513956 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103

https://gerrit.wikimedia.org/r/513956

Change 513956 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103

https://gerrit.wikimedia.org/r/513956

Mentioned in SAL (#wikimedia-operations) [2019-06-03T07:29:55Z] <marostegui> Stop MySQL on db1103 (s2 and s4) for upgrade T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-03T07:48:50Z] <marostegui> Repool db1103 after upgrade T224852

Change 514000 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add db1138 to API in s4

https://gerrit.wikimedia.org/r/514000

Change 514000 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add db1138 to API in s4

https://gerrit.wikimedia.org/r/514000

Change 514208 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Remove db1081 from API

https://gerrit.wikimedia.org/r/514208

Change 514208 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Remove db1081 from API

https://gerrit.wikimedia.org/r/514208

Change 514210 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097

https://gerrit.wikimedia.org/r/514210

Change 514210 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097

https://gerrit.wikimedia.org/r/514210

Mentioned in SAL (#wikimedia-operations) [2019-06-04T05:40:47Z] <marostegui> Stop MySQL on db1091 for MySQL upgrade T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:07:19Z] <marostegui> Upgrade Mysql on labsdb1012 - T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:31:42Z] <marostegui> Stop MySQL on db1125 (sanitarium) s2,s4,s6,s7 to upgrade mysql - T224852

Change 514419 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1084

https://gerrit.wikimedia.org/r/514419

Change 514419 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1084

https://gerrit.wikimedia.org/r/514419

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:49:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1084 for upgrade T224852 (duration: 01m 06s)

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:49:22Z] <marostegui> Upgrade MySQL on db1084 T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:57:26Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1084 after upgrade T224852 (duration: 00m 55s)

Change 514652 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121

https://gerrit.wikimedia.org/r/514652

Change 514652 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121

https://gerrit.wikimedia.org/r/514652

Mentioned in SAL (#wikimedia-operations) [2019-06-06T06:23:50Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1121 for upgrade T224852 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2019-06-06T07:20:36Z] <marostegui> Stop MySQL on db1121 for upgrade, this will generate lag on labs hosts for s6 - T224852

All hosts in codfw are now running 10.1.39 so we are ready for the failover from that front.

Mentioned in SAL (#wikimedia-operations) [2019-06-06T07:35:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1121 after upgrade T224852 (duration: 00m 53s)

Change 517360 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s4-master CNAME

https://gerrit.wikimedia.org/r/517360

Change 517361 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517361

Change 517362 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set s4 in read only

https://gerrit.wikimedia.org/r/517362

Change 517363 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517363

Mentioned in SAL (#wikimedia-operations) [2019-06-18T17:38:01Z] <jynus> testing switchover automation on es2001/es2002 T224852

Testing went as expected:

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002
Starting preflight checks...
[ERROR]: Initial read_only status check failed: original master read_only: 1 / original slave read_only: 1

It errored out because both hosts were in read only.

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
13923 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.33hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 13923 at es2001.codfw.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 13923' -----                                                                                 
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.36hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 13923'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.2979440689086914 seconds
Servers sync at master: es2001-bin.000016:76967 slave: es2002-bin.000002:62473
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
[ERROR]: We could not start replicating towards the original master

It errored out because lack of open port in the es2001 -> es2002 direction (reverse of replication).

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
================                                                                                                        
PASS:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
FAIL:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.33hosts/s]     
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2001.codfw.wmnet
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2001.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[WARNING]: Could not find a pt-heartbeat process to kill, using heartbeat table to determine the section
Setting up original master as read-only
Slave caught up to the master after waiting 0.2981760501861572 seconds
Servers sync at master: es2001-bin.000016:76967 slave: es2002-bin.000002:62473
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section es4 at es2002.codfw.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                             
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.35hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
 1777 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.34hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at es2002.codfw.wmnet running with PID 1777
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

It was successful, but note the warning: because there was no pt-heartbeat, it used the table to determine the section, which is not as reliable.

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2002 es2001
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
 1777 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.37hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 1777 at es2002.codfw.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 1777' -----                                                                                  
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.41hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 1777'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.2967565059661865 seconds
Servers sync at master: es2002-bin.000002:133806 slave: es2001-bin.000016:134389
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section es4 at es2001.codfw.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                             
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.36hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                             
14728 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                        
PASS:  |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.37hosts/s]     
FAIL:  |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at es2001.codfw.wmnet running with PID 14728
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

Last run was fully successfully.

The thing to stress (even if it is a reminder) is that --skip-slave-move must be used (I actually "fixed" it, but it is not properly tested yet), and that HEAD has a limitation that if it is run from a local directory, RemoteExecution.py must remove the 'wmfmariadbpy.' namespace. (I have it like that on my home dir on both datacenters).

Reviewed all patches, only commented on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/517363 (but +1ed too).

Running a last compare.py just to be super-safe. Will also check we have a fresh s4 snapshot in a few hours.

Mentioned in SAL (#wikimedia-operations) [2019-06-18T18:20:41Z] <jynus> running data compare on s4 (commons) databases T224852

Thanks for all the checks!
I will depool db1081 early in the morning, good idea :)

Change 517787 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/517787

Change 517787 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/517787

Mentioned in SAL (#wikimedia-operations) [2019-06-19T04:19:53Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1081 T224852 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2019-06-19T04:28:22Z] <marostegui> Starting pre-steps for the s4 failover that will happen at 05:00 UTC - T224852

Change 517361 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517361

Change 517362 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s4 in read only

https://gerrit.wikimedia.org/r/517362

Change 517363 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1081 to s4 master

https://gerrit.wikimedia.org/r/517363

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:00:16Z] <marostegui> Starting s4 failover from db1068 to db1081 - T224852

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:01:02Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s4 on read-only T224852 (duration: 00m 34s)

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:02:25Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Switchover s4 master eqiad from db1068 to db1081 T224852 (duration: 00m 33s)

Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:03:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove s4 ready only T224852 (duration: 00m 33s)

Change 517360 merged by Marostegui:
[operations/dns@master] wmnet: Update s4-master CNAME

https://gerrit.wikimedia.org/r/517360

This happened successfully.
Read only times (UTC):

Start: 05:01:02
Stop: 05:03:20
Total read only time: 2:18 minutes

Marostegui closed this task as Resolved.Jun 19 2019, 5:23 AM

So far everything looks good, so closing this.