Page MenuHomePhabricator

Failover db1052 (s1) db primary master
Closed, ResolvedPublic

Description

s1 primary master db1052 needs to be decommissioned (T186320) and it is also blocking row B switch upgrade (T183585)

The candidate master is db1067.

This a checklist of what needs to be done before the failover

  • Verify db1067 has STATEMENT as binlog format
  • Upgrade micro-code and reboot on db1067

or

  • Reimage db1067 to stretch/10.1 and check the latest micro-code is installed and active (?)
  • Pick a date for the failover: July 18th (Wednesday) at 06:00AM UTC
  • Communicate liaisons to handle the read only time T197134
  • Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)
  •  Change this to db1083 once db1067 is the master and the network maintenance has been done (T197069#4418823) This step will be followed up at T199861

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I would like to suggest July 18th (Wednesday) at 06:00AM UTC as a failover date

Seems ok to me at first. I would also like to check for blockers for the parent task, even if they are not blockers for this subtask.

jcrespo updated the task description. (Show Details)

Change 442250 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1067 for reimage

https://gerrit.wikimedia.org/r/442250

Change 442250 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1067 for reimage

https://gerrit.wikimedia.org/r/442250

Change 442252 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: db1067: disable notifications and reinstall as stretch

https://gerrit.wikimedia.org/r/442252

Change 442252 merged by Jcrespo:
[operations/puppet@production] mariadb: db1067: disable notifications and reinstall as stretch

https://gerrit.wikimedia.org/r/442252

Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['db1067.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201806271016_jynus_25771.log.

Completed auto-reimage of hosts:

['db1067.eqiad.wmnet']

and were ALL successful.

Change 442279 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications on db1067

https://gerrit.wikimedia.org/r/442279

Change 442279 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications on db1067

https://gerrit.wikimedia.org/r/442279

Vvjjkkii renamed this task from Failover db1052 (s1) db primary master to 64aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 64aaaaaaaa to Failover db1052 (s1) db primary master.Jul 1 2018, 8:12 PM
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)

Change 443825 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/443825

Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)

Maybe it is just me, but having as candidate a host that will be on the same row, is it a good idea? Maybe it is a temporary candidate only?

Change 443825 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1089

https://gerrit.wikimedia.org/r/443825

Mentioned in SAL (#wikimedia-operations) [2018-07-04T12:46:23Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance - T197069 (duration: 02m 57s)

Change 443826 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1089.yaml: Change binlog format

https://gerrit.wikimedia.org/r/443826

Change 443826 merged by Marostegui:
[operations/puppet@production] db1089.yaml: Change binlog format

https://gerrit.wikimedia.org/r/443826

Mentioned in SAL (#wikimedia-operations) [2018-07-04T12:56:18Z] <marostegui> Stop MySQL and reboot db1089 to upgrade+change it to statement - T197069

Pick and prepare a new candidate master (db1089 - row C, which has already been migrated to the new switch)

Maybe it is just me, but having as candidate a host that will be on the same row, is it a good idea? Maybe it is a temporary candidate only?

Sorry - I missed this.
I misread where db1067 is, I will replace it with db1083 once db1067 is the master

I am starting with the checklist preparation in the etherpad - I will also start with the patches soon.

Change 445349 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1083.yaml: Change binlog format

https://gerrit.wikimedia.org/r/445349

Change 445349 merged by Marostegui:
[operations/puppet@production] db1083.yaml: Change binlog format

https://gerrit.wikimedia.org/r/445349

Change 445350 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1083

https://gerrit.wikimedia.org/r/445350

Change 445350 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1083

https://gerrit.wikimedia.org/r/445350

So, I have restarted db1083 with binlog format = STATEMENT.
This host is ready to be the candidate master once db1067 is the new master.

Right now in s1 we have two candidate masters

db1089 -> row C (same row as db1067)
db1083 -> row B (different row as db1067, but this row still needs the switch maintenance, so let's maintain db1089 as candidate until the switch maintenance is done - once done, we can revert db1089 to ROW format so we can have master and candidate in different rows).

Change 445352 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1083 with low weight

https://gerrit.wikimedia.org/r/445352

Change 445352 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1083 with low weight

https://gerrit.wikimedia.org/r/445352

Change 445354 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1067 to s1 masters

https://gerrit.wikimedia.org/r/445354

Change 445363 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/445363

Change 445369 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set up s1 on read only

https://gerrit.wikimedia.org/r/445369

Change 445371 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1067 to master

https://gerrit.wikimedia.org/r/445371

Change 445354 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1067 to s1 master

https://gerrit.wikimedia.org/r/445354

Change 445369 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set up s1 on read only

https://gerrit.wikimedia.org/r/445369

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:00:20Z] <marostegui> Starting s1 failover from db1052 to db1067 - T197069

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:01:31Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s1 on ready only for maintenance T197069 (duration: 01m 08s)

Change 445371 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1067 to master

https://gerrit.wikimedia.org/r/445371

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:04:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set db1067 as master T197069 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2018-07-18T06:07:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: read only OFF after failover T197069 (duration: 00m 53s)

This was smoothly done.
Read only times:

Start: 06:01:31
Finish: 06:07:17

Change 445363 merged by Marostegui:
[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/445363

Resolving this as it has all be done - including the clean up tasks.