Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | RLazarus | T243314 FY2020-2021 Q1 DC switchover and switchback | |||
Resolved | RLazarus | T243316 FY2020-2021 Q1 eqiad -> codfw switchover | |||
Resolved | • Marostegui | T186188 Failover DB masters in row D | |||
Resolved | aaron | T88445 MediaWiki active/active datacenter investigation and work (tracking) | |||
Resolved | • Marostegui | T220170 Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment | |||
Resolved | • Marostegui | T217396 Decommission db1061-db1073 | |||
Resolved | • Marostegui | T224852 Failover s4 primary master: db1068 to db1081 | |||
Resolved | Johan | T224516 Database primary master failover on s4 (commonswiki) |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2019-06-03T05:45:26Z] <marostegui> Upgrade mariadb on dbstore1004 - T224852
eqiad hosts that need upgrade
- db1081 candidate master
- db1084
- db1091
- db1097
- db1103
- db1121 sanitarium master
- db1125 sanitarium
- labsdb1012 (coordinating with @elukey)
- dbstore1004
Change 513947 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081
Change 513947 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081
Mentioned in SAL (#wikimedia-operations) [2019-06-03T06:04:52Z] <marostegui> Stop MySQL on db1081 for upgrade - T224852
Change 513956 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103
Change 513956 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103
Mentioned in SAL (#wikimedia-operations) [2019-06-03T07:29:55Z] <marostegui> Stop MySQL on db1103 (s2 and s4) for upgrade T224852
Mentioned in SAL (#wikimedia-operations) [2019-06-03T07:48:50Z] <marostegui> Repool db1103 after upgrade T224852
Change 514000 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Add db1138 to API in s4
Change 514000 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Add db1138 to API in s4
Change 514208 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Remove db1081 from API
Change 514208 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Remove db1081 from API
Change 514210 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097
Change 514210 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097
Mentioned in SAL (#wikimedia-operations) [2019-06-04T05:40:47Z] <marostegui> Stop MySQL on db1091 for MySQL upgrade T224852
Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:07:19Z] <marostegui> Upgrade Mysql on labsdb1012 - T224852
Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:31:42Z] <marostegui> Stop MySQL on db1125 (sanitarium) s2,s4,s6,s7 to upgrade mysql - T224852
Change 514419 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1084
Change 514419 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1084
Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:49:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1084 for upgrade T224852 (duration: 01m 06s)
Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:49:22Z] <marostegui> Upgrade MySQL on db1084 T224852
Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:57:26Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1084 after upgrade T224852 (duration: 00m 55s)
Change 514652 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121
Change 514652 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1121
Mentioned in SAL (#wikimedia-operations) [2019-06-06T06:23:50Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1121 for upgrade T224852 (duration: 00m 55s)
Mentioned in SAL (#wikimedia-operations) [2019-06-06T07:20:36Z] <marostegui> Stop MySQL on db1121 for upgrade, this will generate lag on labs hosts for s6 - T224852
All hosts in codfw are now running 10.1.39 so we are ready for the failover from that front.
Mentioned in SAL (#wikimedia-operations) [2019-06-06T07:35:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1121 after upgrade T224852 (duration: 00m 53s)
Change 517360 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s4-master CNAME
Change 517361 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1081 to s4 master
Change 517362 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set s4 in read only
Change 517363 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1081 to s4 master
Mentioned in SAL (#wikimedia-operations) [2019-06-18T17:38:01Z] <jynus> testing switchover automation on es2001/es2002 T224852
Testing went as expected:
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002 Starting preflight checks... [ERROR]: Initial read_only status check failed: original master read_only: 1 / original slave read_only: 1
It errored out because both hosts were in read only.
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002 Starting preflight checks... * Original read only values are as expected (master: read_only=0, slave: read_only=1) * The host to fail over is a direct replica of the master * Replication is up and running between the 2 hosts * The replication lag is acceptable: 0 (lower than the configured or default timeout) * The master is not a replica of any other host ----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- 13923 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.33hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Stopping heartbeat pid 13923 at es2001.codfw.wmnet:3306/(none) ----- OUTPUT of '/bin/kill 13923' ----- ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.36hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 13923'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Setting up original master as read-only Slave caught up to the master after waiting 0.2979440689086914 seconds Servers sync at master: es2001-bin.000016:76967 slave: es2002-bin.000002:62473 Stopping original master->slave replication Setting up replica as read-write All commands where successful, current status: original master read_only: 1 / original slave read_only: 0 Trying to invert replication direction [ERROR]: We could not start replicating towards the original master
It errored out because lack of open port in the es2001 -> es2002 direction (reverse of replication).
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2001 es2002 Starting preflight checks... * Original read only values are as expected (master: read_only=0, slave: read_only=1) * The host to fail over is a direct replica of the master * Replication is up and running between the 2 hosts * The replication lag is acceptable: 0 (lower than the configured or default timeout) * The master is not a replica of any other host ----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- ================ PASS: | | 0% (0/1) [00:00<?, ?hosts/s] FAIL: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.33hosts/s] 100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2001.codfw.wmnet 100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2001.codfw.wmnet 0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting. 0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting. 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting. 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting. [WARNING]: Could not find a pt-heartbeat process to kill, using heartbeat table to determine the section Setting up original master as read-only Slave caught up to the master after waiting 0.2981760501861572 seconds Servers sync at master: es2001-bin.000016:76967 slave: es2002-bin.000002:62473 Stopping original master->slave replication Setting up replica as read-write All commands where successful, current status: original master read_only: 1 / original slave read_only: 0 Trying to invert replication direction Starting heartbeat section es4 at es2002.codfw.wmnet ----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' ----- ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.35hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. ----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- 1777 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.34hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Detected heartbeat at es2002.codfw.wmnet running with PID 1777 Verifying everything went as expected... SUCCESS: Master switch completed successfully
It was successful, but note the warning: because there was no pt-heartbeat, it used the table to determine the section, which is not as reliable.
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2002 es2001 Starting preflight checks... * Original read only values are as expected (master: read_only=0, slave: read_only=1) * The host to fail over is a direct replica of the master * Replication is up and running between the 2 hosts * The replication lag is acceptable: 0 (lower than the configured or default timeout) * The master is not a replica of any other host ----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- 1777 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.37hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Stopping heartbeat pid 1777 at es2002.codfw.wmnet:3306/(none) ----- OUTPUT of '/bin/kill 1777' ----- ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.41hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 1777'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Setting up original master as read-only Slave caught up to the master after waiting 0.2967565059661865 seconds Servers sync at master: es2002-bin.000002:133806 slave: es2001-bin.000016:134389 Stopping original master->slave replication Setting up replica as read-write All commands where successful, current status: original master read_only: 1 / original slave read_only: 0 Trying to invert replication direction Starting heartbeat section es4 at es2001.codfw.wmnet ----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' ----- ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.36hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. ----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- 14728 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es4 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid ================ PASS: |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.37hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Detected heartbeat at es2001.codfw.wmnet running with PID 14728 Verifying everything went as expected... SUCCESS: Master switch completed successfully
Last run was fully successfully.
The thing to stress (even if it is a reminder) is that --skip-slave-move must be used (I actually "fixed" it, but it is not properly tested yet), and that HEAD has a limitation that if it is run from a local directory, RemoteExecution.py must remove the 'wmfmariadbpy.' namespace. (I have it like that on my home dir on both datacenters).
Reviewed all patches, only commented on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/517363 (but +1ed too).
Running a last compare.py just to be super-safe. Will also check we have a fresh s4 snapshot in a few hours.
Mentioned in SAL (#wikimedia-operations) [2019-06-18T18:20:41Z] <jynus> running data compare on s4 (commons) databases T224852
Change 517787 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081
Change 517787 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081
Mentioned in SAL (#wikimedia-operations) [2019-06-19T04:19:53Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1081 T224852 (duration: 00m 57s)
Mentioned in SAL (#wikimedia-operations) [2019-06-19T04:28:22Z] <marostegui> Starting pre-steps for the s4 failover that will happen at 05:00 UTC - T224852
Change 517361 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1081 to s4 master
Change 517362 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s4 in read only
Change 517363 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1081 to s4 master
Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:00:16Z] <marostegui> Starting s4 failover from db1068 to db1081 - T224852
Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:01:02Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s4 on read-only T224852 (duration: 00m 34s)
Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:02:25Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Switchover s4 master eqiad from db1068 to db1081 T224852 (duration: 00m 33s)
Mentioned in SAL (#wikimedia-operations) [2019-06-19T05:03:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove s4 ready only T224852 (duration: 00m 33s)
Change 517360 merged by Marostegui:
[operations/dns@master] wmnet: Update s4-master CNAME
This happened successfully.
Read only times (UTC):
Start: 05:01:02
Stop: 05:03:20
Total read only time: 2:18 minutes