Page MenuHomePhabricator

Failover s6 master, db1093 to db1131
Closed, ResolvedPublic

Description

db1131 was moved to row C (T262901).

Let's failover db1093 (that also will be decommissioned next Q (T258361) ).
By doing this we'll balance a little bit more the amount of masters on eqiad's row D, as the current status is:

+----------+------+---------+
| instance | rack | section |
+----------+------+---------+
| db1081   | A2   | s4      |
| db1083   | B1   | s1      |
| db1086   | B3   | s7      |
| db1100   | C2   | s5      |
| db1122   | D6   | s2      |
| db1093   | D8   | s6      |
| db1109   | D8   | s8      |
| db1123   | D8   | s3      |
+----------+------+---------+
8 rows in set (0.002 sec)

By moving this host to row C and by moving s8's one (T239238) we would end up with:

2 masters in row A
3 masters in row B
2 master in row C
2 masters in row D

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Pending comment on the DBA board.

Steps and checklist:

Preparation

NEW master: db1131
OLD master: db1093

  • Check configuration differences between new and old master

pt-config-diff h=db1093.eqiad.wmnet,F=/root/.my.cnf h=db1131.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Set NEW master with weight 0 s6

dbctl instance db1131 edit
dbctl config commit -m "Set db1131 with weight 0 T263227"

  • Topology changes, connect everything to db1131

db-switchover --timeout=15 --read-only-master --replicating-master --only-slave-move db1093.eqiad.wmnet db1131.eqiad.wmnet

Failover:

  • Start the failover

!log Starting s6 eqiad failover from db1093 to db1131 - T263227

  • run switchover script from cumin1001:

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --replicating-master --read-only-master db1093 db1131

  • Give weight to db1093 in s6 dbctl instance db1093 edit
  • Promote db1131 as new master

dbctl --scope eqiad section s6 set-master db1131 && dbctl config commit -m "Promote db1131 on s6 eqiad master "T263227"

  • Restart puppet on old and new masters (for heartbeat): db1131 and db1093

run-puppet-agent -e "switchover to db1131"

Clean up tasks:

  • change events for query killer:
events_coredb_master.sql on the new master db1131
events_coredb_slave.sql on the new slave db1093
dbctl instance db1131 set-candidate-master false
dbctl instance db1093 set-candidate-master true
dbctl config commit -m "Update candidate master status T263227"
  • Check tendril was updated
  • Check zarcillo was updated
  • Update/resolve phabricator ticket about failover

Change 630773 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1131 to s6 master

https://gerrit.wikimedia.org/r/630773

Change 630774 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/630774

Mentioned in SAL (#wikimedia-operations) [2020-09-30T07:18:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1131 with weight 0 T263227', diff saved to https://phabricator.wikimedia.org/P12851 and previous config saved to /var/cache/conftool/dbconfig/20200930-071841-marostegui.json

Change 630773 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1131 to s6 master

https://gerrit.wikimedia.org/r/630773

Mentioned in SAL (#wikimedia-operations) [2020-09-30T07:41:37Z] <marostegui> Starting s6 eqiad failover from db1093 to db1131 - T263227

Mentioned in SAL (#wikimedia-operations) [2020-09-30T07:44:17Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1131 on s6 eqiad master T263227, also give weight to db1093 as new API host', diff saved to https://phabricator.wikimedia.org/P12852 and previous config saved to /var/cache/conftool/dbconfig/20200930-074417-marostegui.json

Change 630774 merged by Marostegui:
[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/630774

This is all done - db1131 is the new s6 eqiad master