Page MenuHomePhabricator

Switchover s1 from db1083 to db1163
Closed, ResolvedPublic

Description

db1083, acting as s1 primary master is on the list of hosts that might have a BBU crash anytime (T258386), also we need to update the kernel (T273280)
We need to promote db1163 instead as a primary master.

When: 28th April at 05:00 AM UTC

Checklist:

  • Double check db1163 has report_host enabled T271106
  • Create a task to communicate the chosen date and send an announcement to the community: T279505

NEW master: db1163
OLD master: db1083

  • Check configuration differences between new and old master

pt-config-diff h=db1083.eqiad.wmnet,F=/root/.my.cnf h=db1163.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Set NEW master with weight 0 s1

dbctl instance db1163 edit
dbctl config commit -m "Set db1163 with weight 0 T278214"

  • Topology changes, connect everything to db1163

db-switchover --timeout=15 --only-slave-move db1083.eqiad.wmnet db1163.eqiad.wmnet

Failover:

  • Start the failover

!log Starting s1 eqiad failover from db1083 to db1163 - T278214

  • Read only on s1

dbctl --scope eqiad section s1 ro "Maintenance till 06:15M UTC " && dbctl config commit -m "Set s1 as read-only for maintenance T278214"

  • Check s1 is indeed on read only
  • run switchover script from cumin1001:

db-switchover --skip-slave-move db1083 db1163 ; echo db1083; mysql.py -hdb1083 -e "show slave status\G" ; echo db1163 ; mysql.py -hdb1163 -e "show slave status\G"

  • Promote db1163 as new master and remove read-only

dbctl --scope eqiad section s1 set-master db1163 && dbctl --scope eqiad section s1 rw && dbctl config commit -m "Promote db1163 to s1 master and remove read-only from s1 T278214"

  • Restart puppet on old and new masters (for heartbeat): db1163 and db1083

run-puppet-agent -e "switchover to db1163"

Clean up tasks:

  • change events for query killer:
events_coredb_master.sql on the new master db1163
events_coredb_slave.sql on the new slave db1083
dbctl instance db1163 set-candidate-master --section s1 false
dbctl instance db1083 set-candidate-master --section s1 true

dbctl instance db1083 edit

  • Update/resolve phabricator ticket about failover

Related Objects

Event Timeline

Let's wait for T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY to be completed on s1, so we can promote a new master with the new schema.

I have started the schema change on s1 today, hopefully it could be fully done sometime next week, so I can schedule the switchover for late April.

I think T276150 will be done by next week, so I am going to schedule this switchover for 28th April at 05:00 AM UTC

Change 682798 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db1118 as stretch

https://gerrit.wikimedia.org/r/682798

Change 682798 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db1118 as stretch

https://gerrit.wikimedia.org/r/682798

Change 682881 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/682881

Change 682882 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s1-master to the right master

https://gerrit.wikimedia.org/r/682882

Mentioned in SAL (#wikimedia-operations) [2021-04-28T04:07:18Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1163 with weight 0 before the switchover T278214', diff saved to https://phabricator.wikimedia.org/P15598 and previous config saved to /var/cache/conftool/dbconfig/20210428-040718-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-04-28T04:08:13Z] <marostegui> Start replication changes, connect everything to db1163 T278214

Change 682881 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/682881

Mentioned in SAL (#wikimedia-operations) [2021-04-28T05:00:21Z] <marostegui> Starting s1 eqiad failover from db1083 to db1163 - T278214

Mentioned in SAL (#wikimedia-operations) [2021-04-28T05:00:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s1 as read-only for maintenance T278214', diff saved to https://phabricator.wikimedia.org/P15599 and previous config saved to /var/cache/conftool/dbconfig/20210428-050041-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-04-28T05:01:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1163 to s1 master and remove read-only from s1 T278214', diff saved to https://phabricator.wikimedia.org/P15600 and previous config saved to /var/cache/conftool/dbconfig/20210428-050138-marostegui.json

Change 682882 merged by Marostegui:

[operations/dns@master] wmnet: Update s1-master to the right master

https://gerrit.wikimedia.org/r/682882

The switchover was done.
RO starts: 05:00:41
RO stops: 05:01:38

Total RO time: 57 seconds

All the switchover steps are done. I will continue with the follow up items once 24h have passed to ensure the new master is working fine.

Change 683473 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/683473

Change 683473 merged by Marostegui:

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/683473

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1118.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104290429_marostegui_7429.log.

Completed auto-reimage of hosts:

['db1118.eqiad.wmnet']

and were ALL successful.

Transfer from db1083 to db1118 on-going

Change 683476 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1083: Disable notifications

https://gerrit.wikimedia.org/r/683476

Change 683476 merged by Marostegui:

[operations/puppet@production] db1083: Disable notifications

https://gerrit.wikimedia.org/r/683476

db1118 cloned from db1083, checking its tables now.

Marostegui updated the task description. (Show Details)

db1118 has been cloned from db1083. Once the tables are finished their checking it will be pooled.
Once it's been working fine for a few days, db1083 will be sent to decommissioning. Closing this.

db1118 tables are good, will pool this host next week.