Failover db1052 (s1) db primary master
Closed, ResolvedPublic

Description

s1 primary master db1052 needs to be decommissioned (T186320) and it is also blocking row B switch upgrade (T183585)

The candidate master is db1067.

This a checklist of what needs to be done before the failover

  • Verify db1067 has STATEMENT as binlog format
  • Upgrade micro-code and reboot on db1067

or

  • Reimage db1067 to stretch/10.1 and check the latest micro-code is installed and active (?)
  • Pick a date for the failover: July 18th (Wednesday) at 06:00AM UTC
  • Communicate liaisons to handle the read only time T197134
  • Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)
  •  Change this to db1083 once db1067 is the master and the network maintenance has been done (T197069#4418823) This step will be followed up at T199861
There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2018, 8:22 AM
Marostegui moved this task from Triage to Backlog on the DBA board.Jun 13 2018, 8:23 AM

I would like to suggest July 18th (Wednesday) at 06:00AM UTC as a failover date

Marostegui moved this task from Backlog to Next on the DBA board.
Marostegui added a subscriber: ayounsi.

Seems ok to me at first. I would also like to check for blockers for the parent task, even if they are not blockers for this subtask.

Marostegui updated the task description. (Show Details)Jun 13 2018, 8:28 AM
jcrespo updated the task description. (Show Details)Jun 13 2018, 8:28 AM
jcrespo updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Jun 13 2018, 8:44 AM
Marostegui updated the task description. (Show Details)Jun 13 2018, 9:20 AM
Marostegui updated the task description. (Show Details)Jun 13 2018, 3:43 PM

Change 442250 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1067 for reimage

https://gerrit.wikimedia.org/r/442250

Change 442250 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1067 for reimage

https://gerrit.wikimedia.org/r/442250

Change 442252 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: db1067: disable notifications and reinstall as stretch

https://gerrit.wikimedia.org/r/442252

Change 442252 merged by Jcrespo:
[operations/puppet@production] mariadb: db1067: disable notifications and reinstall as stretch

https://gerrit.wikimedia.org/r/442252

Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['db1067.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806271016_jynus_25771.log.

Completed auto-reimage of hosts:

['db1067.eqiad.wmnet']

and were ALL successful.

jcrespo updated the task description. (Show Details)Jun 27 2018, 11:21 AM

Change 442279 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications on db1067

https://gerrit.wikimedia.org/r/442279

Change 442279 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications on db1067

https://gerrit.wikimedia.org/r/442279

Vvjjkkii renamed this task from Failover db1052 (s1) db primary master to 64aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 64aaaaaaaa to Failover db1052 (s1) db primary master.Jul 1 2018, 8:12 PM
Marostegui lowered the priority of this task from High to Normal.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Jul 4 2018, 12:32 PM

Change 443825 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/443825

Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)

Maybe it is just me, but having as candidate a host that will be on the same row, is it a good idea? Maybe it is a temporary candidate only?

Change 443825 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/443825

Mentioned in SAL (#wikimedia-operations) [2018-07-04T12:46:23Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance - T197069 (duration: 02m 57s)

Change 443826 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1089.yaml: Change binlog format

https://gerrit.wikimedia.org/r/443826

Change 443826 merged by Marostegui:
[operations/puppet@production] db1089.yaml: Change binlog format

https://gerrit.wikimedia.org/r/443826

Mentioned in SAL (#wikimedia-operations) [2018-07-04T12:56:18Z] <marostegui> Stop MySQL and reboot db1089 to upgrade+change it to statement - T197069

Marostegui updated the task description. (Show Details)Jul 4 2018, 1:17 PM

Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)

Maybe it is just me, but having as candidate a host that will be on the same row, is it a good idea? Maybe it is a temporary candidate only?

Sorry - I missed this.
I misread where db1067 is, I will replace it with db1083 once db1067 is the master

Marostegui updated the task description. (Show Details)Jul 4 2018, 1:18 PM

I am starting with the checklist preparation in the etherpad - I will also start with the patches soon.

Marostegui updated the task description. (Show Details)Jul 12 2018, 7:06 AM

Change 445349 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1083.yaml: Change binlog format

https://gerrit.wikimedia.org/r/445349

Change 445349 merged by Marostegui:
[operations/puppet@production] db1083.yaml: Change binlog format

https://gerrit.wikimedia.org/r/445349

Change 445350 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1083

https://gerrit.wikimedia.org/r/445350

Change 445350 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1083

https://gerrit.wikimedia.org/r/445350

So, I have restarted db1083 with binlog format = STATEMENT.
This host is ready to be the candidate master once db1067 is the new master.

Right now in s1 we have two candidate masters

db1089 -> row C (same row as db1067)
db1083 -> row B (different row as db1067, but this row still needs the switch maintenance, so let's maintain db1089 as candidate until the switch maintenance is done - once done, we can revert db1089 to ROW format so we can have master and candidate in different rows).

Marostegui updated the task description. (Show Details)Jul 12 2018, 7:27 AM

Change 445352 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1083 with low weight

https://gerrit.wikimedia.org/r/445352

Change 445352 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1083 with low weight

https://gerrit.wikimedia.org/r/445352

Change 445354 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1067 to s1 masters

https://gerrit.wikimedia.org/r/445354

Change 445363 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/445363

Change 445369 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set up s1 on read only

https://gerrit.wikimedia.org/r/445369

Change 445371 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1067 to master

https://gerrit.wikimedia.org/r/445371

Mentioned in SAL (#wikimedia-operations) [2018-07-16T14:52:31Z] <marostegui> Change expire_log_days on db1067 - https://phabricator.wikimedia.org/T197069

Mentioned in SAL (#wikimedia-operations) [2018-07-18T04:56:32Z] <marostegui> Starting s1 failover pre steps - https://phabricator.wikimedia.org/T197069

Change 445354 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1067 to s1 master

https://gerrit.wikimedia.org/r/445354

Change 445369 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set up s1 on read only

https://gerrit.wikimedia.org/r/445369

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:00:20Z] <marostegui> Starting s1 failover from db1052 to db1067 - T197069

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:01:31Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s1 on ready only for maintenance T197069 (duration: 01m 08s)

Change 445371 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1067 to master

https://gerrit.wikimedia.org/r/445371

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:04:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set db1067 as master T197069 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:07:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: read only OFF after failover T197069 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:08:39Z] <marostegui> s1 failover finished T197069

This was smoothly done.
Read only times:

Start: 06:01:31
Finish: 06:07:17

Change 445363 merged by Marostegui:
[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/445363

Marostegui updated the task description. (Show Details)Jul 18 2018, 6:41 AM

Resolving this as it has all be done - including the clean up tasks.

Marostegui closed this task as Resolved.Jul 18 2018, 6:43 AM