db1075 is on A2, which will be involved in the PDU maintenance T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
We need to failover db1075 to db1123 which is on D8.
Date&Time: 24th September at 05:00 UTC
read-only window will be required.
db1075 is on A2, which will be involved in the PDU maintenance T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
We need to failover db1075 to db1123 which is on D8.
Date&Time: 24th September at 05:00 UTC
read-only window will be required.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Cmjohnson | T226778 Install new PDUs in rows A/B (Top level tracking task) | |||
Resolved | Jclark-ctr | T227138 a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) | |||
Resolved | • Marostegui | T230783 Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC | |||
Resolved | Trizek-WMF | T230788 Community Relations support needed for several read-only windows (s2, s3, s4 and s8) | |||
Resolved | Jclark-ctr | T233534 db1075 (s3 master) crashed - BBU failure | |||
Unknown Object (Task) | |||||
Declined | None | T233569 Batch db1074-db1079 hosts having BBU issues | |||
Resolved | • Kormat | T233684 Make primary DB masters page on HOST DOWN alert | |||
Resolved | • Marostegui | T322987 db2173 crashed and didn't alert | |||
Resolved | Papaul | T322988 db2173 HW errors |
Reserved window on the deployments calendar: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1837750&oldid=1837737
Change 538003 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1078 to s3 master
Change 538004 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s3-master alias to point to db1078
db1075 (the current master) crashed yesterday with BBU issues T233534: db1075 (s3 master) crashed - BBU failure.
db1078 is also part of the same batch of hosts that have had BBU issues T233569 so my idea is to failover db1075 to db1078 as normally planned, get db1075's BBU replaced once the new BBU arrives {T231670} and then fail over back to db1075
db1123 (current recentchanges, logpager etc) s3 slave is in D8, so thus not affected by the PDU maintenance, so maybe we should failover to thist host instead of db1078 as db1078 is at T233569: Batch db1074-db1079 hosts having BBU issues
Change 538470 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1123: Change binlog format to STATEMENT
Change 538470 merged by Marostegui:
[operations/puppet@production] db1123: Change binlog format to STATEMENT
Mentioned in SAL (#wikimedia-operations) [2019-09-23T07:06:29Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1123 to change binlog format T230783', diff saved to https://phabricator.wikimedia.org/P9145 and previous config saved to /var/cache/conftool/dbconfig/20190923-070628-marostegui.json
Mentioned in SAL (#wikimedia-operations) [2019-09-23T07:08:10Z] <marostegui> Stop MySQL on db1123 to reboot to change binlog format and kernel - T230783
Change 538003 abandoned by Marostegui:
mariadb: Promote db1078 to s3 master
Reason:
going to promote db1123 instead, will do it in a different patch
Change 538522 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1123 to s3 master
Mentioned in SAL (#wikimedia-operations) [2019-09-24T04:13:01Z] <marostegui> Start pre switchover steps - T230783
Mentioned in SAL (#wikimedia-operations) [2019-09-24T04:21:22Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set weight 0 to db1123 T230783', diff saved to https://phabricator.wikimedia.org/P9156 and previous config saved to /var/cache/conftool/dbconfig/20190924-042121-marostegui.json
Change 538522 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1123 to s3 master
Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:00:14Z] <marostegui> Starting s3 failover from db1075 to db1123 - T230783
Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:00:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s3 as read-only for maintenance T230783', diff saved to https://phabricator.wikimedia.org/P9157 and previous config saved to /var/cache/conftool/dbconfig/20190924-050034-marostegui.json
Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:10:14Z] <cdanis> T230783 mark DEFAULT not s3 as readonly in etcd etcd dbconfig data
Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:11:49Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1123 to s3 master and remove read-only from s3 T230783', diff saved to https://phabricator.wikimedia.org/P9158 and previous config saved to /var/cache/conftool/dbconfig/20190924-051147-marostegui.json
Mentioned in SAL (#wikimedia-operations) [2019-09-24T05:13:08Z] <cdanis@cumin1001> dbctl commit (dc=all): 're-do T230783 master promotion and set read-write', diff saved to https://phabricator.wikimedia.org/P9159 and previous config saved to /var/cache/conftool/dbconfig/20190924-051307-cdanis.json
Change 538004 merged by Marostegui:
[operations/dns@master] wmnet: Update s3-master alias to point to db1123
This was done successfully.
read only start: 05:10:14 UTC AM
read only stop: 05:13:08 UTC AM
total read only time: 2 minutes 54 seconds.
We had a slightly longer read only time compared to the all the previous ones due to some issues with the way we set read-only, those will be followed up at T233679
Thanks everyone who helped out!