Page MenuHomePhabricator

Failover s2 primary master
Closed, ResolvedPublic

Description

s2 primary master db1054 has BBU issues (T194867) and it also needs to be decommissioned (T186320).

The candidate master is db1066.

This a checklist of what needs to be done before the failover

  • Move db1066 to a different rack T193847
  • Manually fail disk #6 on db1066 and get it replaced (T194955)
  • Finish s2 replicas reimage to stretch (only missing db2035 and db1074)
  • Pick a date for the failover: 13th June - 06:00AM UTC - 06:30AM UTC
  • Communicate liaisons to handle the read only time T195487
  • Prepare and do the actual failover

Event Timeline

Marostegui created this task.
Marostegui moved this task from Triage to Pending comment on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2018-05-18T13:59:56Z] <marostegui> Manually fail disk #6 on db1066 - T194870

Change 434339 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1074 for maintenance

https://gerrit.wikimedia.org/r/434339

Change 434339 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1074 for maintenance

https://gerrit.wikimedia.org/r/434339

Marostegui updated the task description. (Show Details)

We have agreed this will be done the 13th of June 2018 from 06:00AM UTC till 06:30AM UTC

Change 438200 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1066

https://gerrit.wikimedia.org/r/438200

Change 438200 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1066

https://gerrit.wikimedia.org/r/438200

Mentioned in SAL (#wikimedia-operations) [2018-06-08T08:18:44Z] <marostegui> Stop MySQL and reboot db1066 for intel-microcode install - T194870

db1066 candidate master has been rebooted to pick up the intel-microcodes before the failover

Change 439530 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1066 to master

https://gerrit.wikimedia.org/r/439530

Change 439531 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set s2 as read only

https://gerrit.wikimedia.org/r/439531

Change 439532 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1066 to master and remove read only

https://gerrit.wikimedia.org/r/439532

Change 439533 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s2-master CNAME

https://gerrit.wikimedia.org/r/439533

Change 439534 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s2.hosts: db1066 is now s2 primary master

https://gerrit.wikimedia.org/r/439534

The draft with the steps and the patches is now done.
@jcrespo please review them! Thanks

Mentioned in SAL (#wikimedia-operations) [2018-06-13T05:11:05Z] <marostegui> Starting topology changes in order to get ready for s2 failover - T194870

Change 439530 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1066 to master

https://gerrit.wikimedia.org/r/439530

Change 439531 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s2 as read only

https://gerrit.wikimedia.org/r/439531

Mentioned in SAL (#wikimedia-operations) [2018-06-13T06:00:16Z] <marostegui> Starting s2 failover from db1054 to db1066 - T194870

Mentioned in SAL (#wikimedia-operations) [2018-06-13T06:01:35Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s2 on read-only for primary db master maintnance - T194870 (duration: 01m 08s)

Change 439532 merged by Marostegui:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1066 to master and remove read only

https://gerrit.wikimedia.org/r/439532

Mentioned in SAL (#wikimedia-operations) [2018-06-13T06:05:31Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove read only from s2 - T194870 (duration: 00m 34s)

Mentioned in SAL (#wikimedia-operations) [2018-06-13T06:08:41Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove read only from s2 - T194870 (duration: 00m 33s)

Change 439534 merged by jenkins-bot:
[operations/software@master] s2.hosts: db1066 is now s2 primary master

https://gerrit.wikimedia.org/r/439534

Change 439533 merged by Marostegui:
[operations/dns@master] wmnet: Update s2-master CNAME

https://gerrit.wikimedia.org/r/439533

This was completed.
read only time started at 06:01
read only time finished at 06:08
Total read only time was around 7 minutes

Marostegui renamed this task from 9tcaaaaaaa to Failover s2 primary master.Jul 2 2018, 5:20 AM
Marostegui closed this task as Resolved.
Marostegui claimed this task.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)