Page MenuHomePhabricator

Switchover s4 (commonswiki) from db1081 to db1138
Closed, ResolvedPublic

Description

db1081, acting as s4 (commonswiki) primary master is on the list of hosts that might have a BBU crash anytime (T258386).
We need to promote db1138 instead as a primary master.

When: Tue 26th January 07:00AM UTC - 07:15 AM UTC

Checklist:

  • Restart db1138 to pick up report_host T271106
  • Create a task to communicate the chosen date and send an announcement to the community

NEW master: db1138
OLD master: db1081

  • Check configuration differences between new and old master

pt-config-diff h=db1081.eqiad.wmnet,F=/root/.my.cnf h=db1138.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Set NEW master with weight 0 s4

dbctl instance db1138 edit
dbctl config commit -m "Set db1138 with weight 0 T271427"

  • Topology changes, connect everything to db1138

db-switchover --timeout=15 --only-slave-move db1081.eqiad.wmnet db1138.eqiad.wmnet

Failover:

  • Start the failover

!log Starting s4 eqiad failover from db1081 to db1138 - T271427

  • Read only on s4

dbctl --scope eqiad section s4 ro "Maintenance till 07:15M UTC T271427" && dbctl config commit -m "Set s4 as read-only for maintenance T271427"

  • Check s4 is indeed on read only
  • run switchover script from cumin1001:

db-switchover --skip-slave-move db1081 db1138 ; echo db1081; mysql.py -hdb1081 -e "show slave status\G" ; echo db1138 ; mysql.py -hdb1138 -e "show slave status\G

  • Promote db1138 as new master and remove read-only

dbctl --scope eqiad section s4 set-master db1138 && dbctl --scope eqiad section s4 rw && dbctl config commit -m "Promote db1138 to s4 master and remove read-only from s4 T271427"

  • Restart puppet on old and new masters (for heartbeat): db1138 and db1081

run-puppet-agent -e "switchover to db1138"

  • Give weight to db1081 in s4

dbctl instance db1081 edit

  • left depooled

Clean up tasks:

  • change events for query killer:
events_coredb_master.sql on the new master db1138
events_coredb_slave.sql on the new slave db1081
dbctl instance db1138 set-candidate-master --section s4 false
dbctl instance db1081 set-candidate-master --section s4 true

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-01-12T10:13:19Z] <marostegui> Restart mysql on db1138 to pick up new config T271427 T271106

Upgraded the kernel on db1138 as part of: T272255

Added to the deployments calendar

Change 658211 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1138 to s4 master

https://gerrit.wikimedia.org/r/658211

Change 658213 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s4-master CNAME

https://gerrit.wikimedia.org/r/658213

Mentioned in SAL (#wikimedia-operations) [2021-01-26T05:43:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set candidate master to weight 0 before the failover T271427', diff saved to https://phabricator.wikimedia.org/P13952 and previous config saved to /var/cache/conftool/dbconfig/20210126-054337-marostegui.json

Change 658211 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1138 to s4 master

https://gerrit.wikimedia.org/r/658211

Pre failover steps are done

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:00:22Z] <marostegui> Starting s4 eqiad failover from db1081 to db1138 - T271427

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:00:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s4 as read-only for maintenance T271427', diff saved to https://phabricator.wikimedia.org/P13953 and previous config saved to /var/cache/conftool/dbconfig/20210126-070037-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:01:53Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1138 to s4 master and remove read-only from s4 T271427', diff saved to https://phabricator.wikimedia.org/P13954 and previous config saved to /var/cache/conftool/dbconfig/20210126-070152-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-01-26T07:04:43Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1081 (s4 old master) - T271427', diff saved to https://phabricator.wikimedia.org/P13955 and previous config saved to /var/cache/conftool/dbconfig/20210126-070443-marostegui.json

Change 658213 merged by Marostegui:
[operations/dns@master] wmnet: Update s4-master CNAME

https://gerrit.wikimedia.org/r/658213

This was done successfully
read only on: 07:00:37
read only off: 07:01:52

total read only time: 1:15 minutes

The update to zarcillo database was done manually as it failed, we are investigating and following up on irc about it. Not a big deal.

Marostegui updated the task description. (Show Details)
Marostegui added subscribers: Kormat, jcrespo.

Thanks @jcrespo and @Kormat for the support!