Page MenuHomePhabricator

Switchover s4 (commonswiki) from db1081 to db1138
Open, MediumPublic

Description

db1081, acting as s4 (commonswiki) primary master is on the list of hosts that might have a BBU crash anytime (T258386).
We need to promote db1138 instead as a primary master.

When: Tue 26th January 07:00AM UTC - 07:15 AM UTC

Checklist:

  • Restart db1138 to pick up report_host T271106
  • Create a task to communicate the chosen date and send an announcement to the community

NEW master: db1138
OLD master: db1081

  • Check configuration differences between new and old master

pt-config-diff h=db1081.eqiad.wmnet,F=/root/.my.cnf h=db1138.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Set NEW master with weight 0 s4

dbctl instance db1138 edit
dbctl config commit -m "Set db1138 with weight 0 T271427"

  • Topology changes, connect everything to db1138

db-switchover --timeout=15 --read-only-master --replicating-master --only-slave-move db1081.eqiad.wmnet db1138.eqiad.wmnet

  • Disable puppet @db1138 and @db1081 puppet agent --disable "switchover to db1138"
  • Merge gerrit puppet change to promote db1138: TODO

Failover:

  • Start the failover

!log Starting s4 eqiad failover from db1081 to db1138 - T271427

  • run switchover script from cumin1001:

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --replicating-master --read-only-master db1081 db1138

  • Promote db1138 as new master

dbctl --scope eqiad section s4 set-master db1138 && dbctl config commit -m "Promote db1138 on s4 eqiad master "T271427"

  • Restart puppet on old and new masters (for heartbeat): db1138 and db1081

run-puppet-agent -e "switchover to db1138"

  • Give weight to db1081 in s4 dbctl instance db1081 edit

Clean up tasks:

  • change events for query killer:
events_coredb_master.sql on the new master db1138
events_coredb_slave.sql on the new slave db1081
  • Update DNS: TODO
  • Update candidate master dbctl notes and pick new candidate master: db1081
dbctl instance db1138 set-candidate-master false
dbctl instance db1081 set-candidate-master true
dbctl config commit -m "Update candidate master status T271427"
  • Check tendril was updated
  • Check zarcillo was updated
  • Update/resolve phabricator ticket about failover

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Jan 7, 3:03 PM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-01-12T10:13:19Z] <marostegui> Restart mysql on db1138 to pick up new config T271427 T271106

Marostegui updated the task description. (Show Details)Tue, Jan 12, 10:17 AM
Marostegui updated the task description. (Show Details)Tue, Jan 12, 10:22 AM
Marostegui updated the task description. (Show Details)Tue, Jan 12, 11:20 AM

Upgraded the kernel on db1138 as part of: T272255

Added to the deployments calendar