Page MenuHomePhabricator

Switchover s4 (commonswiki) from db1081 to db1138
Closed, ResolvedPublic

Description

db1081, acting as s4 (commonswiki) primary master is on the list of hosts that might have a BBU crash anytime (T258386).
We need to promote db1138 instead as a primary master.

When: Tue 26th January 07:00AM UTC - 07:15 AM UTC

Checklist:

  • Restart db1138 to pick up report_host T271106
  • Create a task to communicate the chosen date and send an announcement to the community

NEW master: db1138
OLD master: db1081

  • Check configuration differences between new and old master

pt-config-diff h=db1081.eqiad.wmnet,F=/root/.my.cnf h=db1138.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Set NEW master with weight 0 s4

dbctl instance db1138 edit
dbctl config commit -m "Set db1138 with weight 0 T271427"

  • Topology changes, connect everything to db1138

db-switchover --timeout=15 --only-slave-move db1081.eqiad.wmnet db1138.eqiad.wmnet

Failover:

  • Start the failover

!log Starting s4 eqiad failover from db1081 to db1138 - T271427

  • Read only on s4

dbctl --scope eqiad section s4 ro "Maintenance till 07:15M UTC T271427" && dbctl config commit -m "Set s4 as read-only for maintenance T271427"

  • Check s4 is indeed on read only
  • run switchover script from cumin1001:

db-switchover --skip-slave-move db1081 db1138 ; echo db1081; mysql.py -hdb1081 -e "show slave status\G" ; echo db1138 ; mysql.py -hdb1138 -e "show slave status\G

  • Promote db1138 as new master and remove read-only

dbctl --scope eqiad section s4 set-master db1138 && dbctl --scope eqiad section s4 rw && dbctl config commit -m "Promote db1138 to s4 master and remove read-only from s4 T271427"

  • Restart puppet on old and new masters (for heartbeat): db1138 and db1081

run-puppet-agent -e "switchover to db1138"

  • Give weight to db1081 in s4

dbctl instance db1081 edit

  • left depooled

Clean up tasks:

  • change events for query killer:
events_coredb_master.sql on the new master db1138
events_coredb_slave.sql on the new slave db1081
dbctl instance db1138 set-candidate-master --section s4 false
dbctl instance db1081 set-candidate-master --section s4 true

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-01-12T10:13:19Z] <marostegui> Restart mysql on db1138 to pick up new config T271427 T271106

Upgraded the kernel on db1138 as part of: T272255

Change 658211 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1138 to s4 master

https://gerrit.wikimedia.org/r/658211

Change 658213 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s4-master CNAME

https://gerrit.wikimedia.org/r/658213

Mentioned in SAL (#wikimedia-operations) [2021-01-26T05:43:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set candidate master to weight 0 before the failover T271427', diff saved to https://phabricator.wikimedia.org/P13952 and previous config saved to /var/cache/conftool/dbconfig/20210126-054337-marostegui.json

Change 658211 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1138 to s4 master

https://gerrit.wikimedia.org/r/658211

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:00:22Z] <marostegui> Starting s4 eqiad failover from db1081 to db1138 - T271427

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:00:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s4 as read-only for maintenance T271427', diff saved to https://phabricator.wikimedia.org/P13953 and previous config saved to /var/cache/conftool/dbconfig/20210126-070037-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:01:53Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1138 to s4 master and remove read-only from s4 T271427', diff saved to https://phabricator.wikimedia.org/P13954 and previous config saved to /var/cache/conftool/dbconfig/20210126-070152-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:04:43Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1081 (s4 old master) - T271427', diff saved to https://phabricator.wikimedia.org/P13955 and previous config saved to /var/cache/conftool/dbconfig/20210126-070443-marostegui.json

Change 658213 merged by Marostegui:
[operations/dns@master] wmnet: Update s4-master CNAME

https://gerrit.wikimedia.org/r/658213

This was done successfully
read only on: 07:00:37
read only off: 07:01:52

total read only time: 1:15 minutes

The update to zarcillo database was done manually as it failed, we are investigating and following up on irc about it. Not a big deal.

Marostegui updated the task description. (Show Details)
Marostegui added subscribers: Kormat, jcrespo.

Thanks @jcrespo and @Kormat for the support!