Failover db1052 (s1) db primary master
Open, NormalPublic

Description

s1 primary master db1052 needs to be decommissioned (T186320) and it is also blocking row B switch upgrade (T183585)

The candidate master is db1067.

This a checklist of what needs to be done before the failover

  • Verify db1067 has STATEMENT as binlog format
  • Upgrade micro-code and reboot on db1067

or

  • Reimage db1067 to stretch/10.1 and check the latest micro-code is installed and active (?)
  • Pick a date for the failover: July 18th (Wednesday) at 06:00AM UTC
  • Communicate liaisons to handle the read only time T197134
  • Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)
  •  Change this to db1083 once db1067 is the master and the network maintenance has been done (T197069#4418823)
Marostegui triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2018, 8:22 AM
Marostegui moved this task from Triage to Backlog on the DBA board.Jun 13 2018, 8:23 AM

I would like to suggest July 18th (Wednesday) at 06:00AM UTC as a failover date

Marostegui moved this task from Backlog to Next on the DBA board.
Marostegui added a subscriber: ayounsi.

Seems ok to me at first. I would also like to check for blockers for the parent task, even if they are not blockers for this subtask.

Marostegui updated the task description. (Show Details)Jun 13 2018, 8:28 AM
jcrespo updated the task description. (Show Details)Jun 13 2018, 8:28 AM
jcrespo updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Jun 13 2018, 8:44 AM
Marostegui updated the task description. (Show Details)Jun 13 2018, 9:20 AM
Marostegui updated the task description. (Show Details)Jun 13 2018, 3:43 PM

Change 442250 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1067 for reimage

https://gerrit.wikimedia.org/r/442250

Change 442250 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1067 for reimage

https://gerrit.wikimedia.org/r/442250

Change 442252 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: db1067: disable notifications and reinstall as stretch

https://gerrit.wikimedia.org/r/442252

Change 442252 merged by Jcrespo:
[operations/puppet@production] mariadb: db1067: disable notifications and reinstall as stretch

https://gerrit.wikimedia.org/r/442252

Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['db1067.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806271016_jynus_25771.log.

Completed auto-reimage of hosts:

['db1067.eqiad.wmnet']

and were ALL successful.

jcrespo updated the task description. (Show Details)Wed, Jun 27, 11:21 AM

Change 442279 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications on db1067

https://gerrit.wikimedia.org/r/442279

Change 442279 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications on db1067

https://gerrit.wikimedia.org/r/442279

Vvjjkkii renamed this task from Failover db1052 (s1) db primary master to 64aaaaaaaa.Sun, Jul 1, 1:04 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 64aaaaaaaa to Failover db1052 (s1) db primary master.Sun, Jul 1, 8:12 PM
Marostegui lowered the priority of this task from High to Normal.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Wed, Jul 4, 12:32 PM

Change 443825 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/443825

Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)

Maybe it is just me, but having as candidate a host that will be on the same row, is it a good idea? Maybe it is a temporary candidate only?

Change 443825 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/443825

Mentioned in SAL (#wikimedia-operations) [2018-07-04T12:46:23Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance - T197069 (duration: 02m 57s)

Change 443826 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1089.yaml: Change binlog format

https://gerrit.wikimedia.org/r/443826

Change 443826 merged by Marostegui:
[operations/puppet@production] db1089.yaml: Change binlog format

https://gerrit.wikimedia.org/r/443826

Mentioned in SAL (#wikimedia-operations) [2018-07-04T12:56:18Z] <marostegui> Stop MySQL and reboot db1089 to upgrade+change it to statement - T197069

Marostegui updated the task description. (Show Details)Wed, Jul 4, 1:17 PM

Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)

Maybe it is just me, but having as candidate a host that will be on the same row, is it a good idea? Maybe it is a temporary candidate only?

Sorry - I missed this.
I misread where db1067 is, I will replace it with db1083 once db1067 is the master

Marostegui updated the task description. (Show Details)Wed, Jul 4, 1:18 PM

I am starting with the checklist preparation in the etherpad - I will also start with the patches soon.

Marostegui updated the task description. (Show Details)Thu, Jul 12, 7:06 AM

Change 445349 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1083.yaml: Change binlog format

https://gerrit.wikimedia.org/r/445349

Change 445349 merged by Marostegui:
[operations/puppet@production] db1083.yaml: Change binlog format

https://gerrit.wikimedia.org/r/445349

Change 445350 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1083

https://gerrit.wikimedia.org/r/445350

Change 445350 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1083

https://gerrit.wikimedia.org/r/445350

So, I have restarted db1083 with binlog format = STATEMENT.
This host is ready to be the candidate master once db1067 is the new master.

Right now in s1 we have two candidate masters

db1089 -> row C (same row as db1067)
db1083 -> row B (different row as db1067, but this row still needs the switch maintenance, so let's maintain db1089 as candidate until the switch maintenance is done - once done, we can revert db1089 to ROW format so we can have master and candidate in different rows).

Marostegui updated the task description. (Show Details)Thu, Jul 12, 7:27 AM

Change 445352 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1083 with low weight

https://gerrit.wikimedia.org/r/445352

Change 445352 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1083 with low weight

https://gerrit.wikimedia.org/r/445352

Change 445354 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1067 to s1 masters

https://gerrit.wikimedia.org/r/445354

Change 445363 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/445363

Change 445369 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set up s1 on read only

https://gerrit.wikimedia.org/r/445369

Change 445371 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1067 to master

https://gerrit.wikimedia.org/r/445371

Mentioned in SAL (#wikimedia-operations) [2018-07-16T14:52:31Z] <marostegui> Change expire_log_days on db1067 - https://phabricator.wikimedia.org/T197069