Page MenuHomePhabricator

Switchover s7 from db1086 to db1136
Closed, ResolvedPublic

Description

db1086, acting as s7 primary master is on the list of hosts that might have a BBU crash anytime (T258386), also we need to update the kernel (T273280)
We need to promote db1136 instead as a primary master.

When: 23rd March, 06:00 AM UTC

Checklist:

  • Double check db1136 has report_host enabled T271106
  • Create a task to communicate the chosen date and send an announcement to the community: T276899

NEW master: db1136
OLD master: db1086

  • Check configuration differences between new and old master

pt-config-diff h=db1086.eqiad.wmnet,F=/root/.my.cnf h=db1136.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Set NEW master with weight 0 s7

dbctl instance db1136 edit
dbctl config commit -m "Set db1136 with weight 0 T274336"

  • Topology changes, connect everything to db1136

db-switchover --timeout=15 --only-slave-move db1086.eqiad.wmnet db1136.eqiad.wmnet

Failover:

  • Start the failover

!log Starting s7 eqiad failover from db1086 to db1136 - T274336

  • Read only on s7

dbctl --scope eqiad section s7 ro "Maintenance till 07:15M UTC " && dbctl config commit -m "Set s7 as read-only for maintenance T274336"

  • Check s7 is indeed on read only
  • run switchover script from cumin1001:

db-switchover --skip-slave-move db1086 db1136 ; echo db1086; mysql.py -hdb1086 -e "show slave status\G" ; echo db1136 ; mysql.py -hdb1136 -e "show slave status\G"

  • Promote db1136 as new master and remove read-only

dbctl --scope eqiad section s7 set-master db1136 && dbctl --scope eqiad section s7 rw && dbctl config commit -m "Promote db1136 to s7 master and remove read-only from s7 T274336"

  • Restart puppet on old and new masters (for heartbeat): db1136 and db1086

run-puppet-agent -e "switchover to db1136"

  • Give weight to db1086 in s7

dbctl instance db1086 edit

Clean up tasks:

  • change events for query killer:
events_coredb_master.sql on the new master db1136
events_coredb_slave.sql on the new slave db1086
dbctl instance db1136 set-candidate-master --section s7 false
dbctl instance db1086 set-candidate-master --section s7 true

Event Timeline

Marostegui added a subscriber: MoritzMuehlenhoff.

Waiting for the new kernel to be released for Stretch @MoritzMuehlenhoff

I have scheduled this for 23rd March at 06:00 AM UTC

Reserved maintenance window on the Deployments' calendar

~# mysql.py -hdb1136 -e "select @@report_host"
+--------------------+
| @@report_host      |
+--------------------+
| db1136.eqiad.wmnet |
+--------------------+

Change 673195 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1136 to s7 master

https://gerrit.wikimedia.org/r/673195

Change 673196 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s7-master cname

https://gerrit.wikimedia.org/r/673196

Mentioned in SAL (#wikimedia-operations) [2021-03-23T05:12:10Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set weight 0 to db1136 before failover T274336', diff saved to https://phabricator.wikimedia.org/P14992 and previous config saved to /var/cache/conftool/dbconfig/20210323-051210-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-03-23T05:13:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Add db1174 to api T274336', diff saved to https://phabricator.wikimedia.org/P14993 and previous config saved to /var/cache/conftool/dbconfig/20210323-051346-marostegui.json

Change 673195 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1136 to s7 master

https://gerrit.wikimedia.org/r/673195

Mentioned in SAL (#wikimedia-operations) [2021-03-23T06:00:38Z] <marostegui> Starting s7 eqiad failover from db1086 to db1136 - T274336

Mentioned in SAL (#wikimedia-operations) [2021-03-23T06:01:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s7 as read-only for maintenance T274336', diff saved to https://phabricator.wikimedia.org/P14994 and previous config saved to /var/cache/conftool/dbconfig/20210323-060104-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-03-23T06:02:17Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1136 to s7 master and remove read-only from s7 T274336', diff saved to https://phabricator.wikimedia.org/P14995 and previous config saved to /var/cache/conftool/dbconfig/20210323-060216-marostegui.json

Change 673196 merged by Marostegui:
[operations/dns@master] wmnet: Update s7-master cname

https://gerrit.wikimedia.org/r/673196

This was done:
RO starts: 06:01:05
RO stops: 06:02:17

Total: read-only time: 1:12 minutes

Marostegui updated the task description. (Show Details)

Thanks everyone for the support!