Page MenuHomePhabricator

Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC
Closed, ResolvedPublic

Description

db1075 is on A2, which will be involved in the PDU maintenance T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
We need to failover db1075 to db1123 which is on D8.

Date&Time: 24th September at 05:00 UTC

read-only window will be required.

Related Objects

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.

Change 538003 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1078 to s3 master

https://gerrit.wikimedia.org/r/538003

Change 538004 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s3-master alias to point to db1078

https://gerrit.wikimedia.org/r/538004

db1075 (the current master) crashed yesterday with BBU issues T233534: db1075 (s3 master) crashed - BBU failure.
db1078 is also part of the same batch of hosts that have had BBU issues T233569 so my idea is to failover db1075 to db1078 as normally planned, get db1075's BBU replaced once the new BBU arrives {T231670} and then fail over back to db1075

db1123 (current recentchanges, logpager etc) s3 slave is in D8, so thus not affected by the PDU maintenance, so maybe we should failover to thist host instead of db1078 as db1078 is at T233569: Batch db1074-db1079 hosts having BBU issues

Marostegui renamed this task from Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC to Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC.Sep 23 2019, 6:43 AM
Marostegui updated the task description. (Show Details)

Change 538470 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1123: Change binlog format to STATEMENT

https://gerrit.wikimedia.org/r/538470

Change 538470 merged by Marostegui:
[operations/puppet@production] db1123: Change binlog format to STATEMENT

https://gerrit.wikimedia.org/r/538470

Mentioned in SAL (#wikimedia-operations) [2019-09-23T07:06:29Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1123 to change binlog format T230783', diff saved to https://phabricator.wikimedia.org/P9145 and previous config saved to /var/cache/conftool/dbconfig/20190923-070628-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-23T07:08:10Z] <marostegui> Stop MySQL on db1123 to reboot to change binlog format and kernel - T230783

Change 538003 abandoned by Marostegui:
mariadb: Promote db1078 to s3 master

Reason:
going to promote db1123 instead, will do it in a different patch

https://gerrit.wikimedia.org/r/538003

Change 538522 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1123 to s3 master

https://gerrit.wikimedia.org/r/538522

Mentioned in SAL (#wikimedia-operations) [2019-09-24T04:13:01Z] <marostegui> Start pre switchover steps - T230783

Mentioned in SAL (#wikimedia-operations) [2019-09-24T04:21:22Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set weight 0 to db1123 T230783', diff saved to https://phabricator.wikimedia.org/P9156 and previous config saved to /var/cache/conftool/dbconfig/20190924-042121-marostegui.json

Change 538522 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1123 to s3 master

https://gerrit.wikimedia.org/r/538522

Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:00:14Z] <marostegui> Starting s3 failover from db1075 to db1123 - T230783

Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:00:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s3 as read-only for maintenance T230783', diff saved to https://phabricator.wikimedia.org/P9157 and previous config saved to /var/cache/conftool/dbconfig/20190924-050034-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:10:14Z] <cdanis> T230783 mark DEFAULT not s3 as readonly in etcd etcd dbconfig data

Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:11:49Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1123 to s3 master and remove read-only from s3 T230783', diff saved to https://phabricator.wikimedia.org/P9158 and previous config saved to /var/cache/conftool/dbconfig/20190924-051147-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:13:08Z] <cdanis@cumin1001> dbctl commit (dc=all): 're-do T230783 master promotion and set read-write', diff saved to https://phabricator.wikimedia.org/P9159 and previous config saved to /var/cache/conftool/dbconfig/20190924-051307-cdanis.json

Change 538004 merged by Marostegui:
[operations/dns@master] wmnet: Update s3-master alias to point to db1123

https://gerrit.wikimedia.org/r/538004

This was done successfully.

read only start: 05:10:14 UTC AM
read only stop: 05:13:08 UTC AM

total read only time: 2 minutes 54 seconds.

We had a slightly longer read only time compared to the all the previous ones due to some issues with the way we set read-only, those will be followed up at T233679

Thanks everyone who helped out!