Page MenuHomePhabricator

Switchover s8 master (db2165 -> db2161)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s8.dblist

Checklist:

NEW primary: db2161
OLD primary: db2165

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2165.codfw.wmnet h=db2161.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s8 T330056" 'A:db-section-s8'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2161 set-weight 0
sudo dbctl config commit -m "Set db2161 with weight 0 T330056"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2165 db2161
  • Disable puppet on both nodes
sudo cumin 'db2165* or db2161*' 'disable-puppet "primary switchover T330056"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s8 codfw failover from db2165 to db2161 - T330056
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2165 db2161
echo "===== db2165 (OLD)"; sudo db-mysql db2165 -e 'show slave status\\G'
echo "===== db2161 (NEW)"; sudo db-mysql db2161 -e 'show slave status\\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s8 set-master db2161
sudo dbctl config commit -m "Promote db2161 to s8 primary T330056"
  • Restart puppet on both hosts:
sudo cumin 'db2165* or db2161*' 'run-puppet-agent -e "primary switchover T330056"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2161 heartbeat -e "delete from heartbeat where file like 'db2165%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2161
events_coredb_slave.sql on the new slave db2165
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2165 set-candidate-master --section s8 true
sudo dbctl instance db2161 set-candidate-master --section s8 false
(dborch1001): sudo orchestrator-client -c untag -i db2161 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2165 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's8';"
  • (If needed): Depool db2165 for maintenance.
sudo dbctl instance db2165 depool
sudo dbctl config commit -m "Depool db2165 T330056"
  • Change db2165 weight to mimic the previous weight db2161:
sudo dbctl instance db2165 edit
  • Apply outstanding schema changes to db2165 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 890346 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2161 to s8 master

https://gerrit.wikimedia.org/r/890346

The old master is on C5, the new master is on B6. This means I need to do a switchover, do the externallinks change and then re-switch it back before tomorrow.

Mentioned in SAL (#wikimedia-operations) [2023-02-20T09:26:50Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330056

Mentioned in SAL (#wikimedia-operations) [2023-02-20T09:27:14Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330056

Mentioned in SAL (#wikimedia-operations) [2023-02-20T09:27:27Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db2161 with weight 0 T330056', diff saved to https://phabricator.wikimedia.org/P44686 and previous config saved to /var/cache/conftool/dbconfig/20230220-092727-ladsgroup.json

Ladsgroup triaged this task as Medium priority.
Ladsgroup moved this task from Triage to In progress on the DBA board.

Change 890346 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db2161 to s8 master

https://gerrit.wikimedia.org/r/890346

Mentioned in SAL (#wikimedia-operations) [2023-02-20T09:52:31Z] <Amir1> Starting s8 codfw failover from db2165 to db2161 - T330056

Mentioned in SAL (#wikimedia-operations) [2023-02-20T09:53:08Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db2161 to s8 primary T330056', diff saved to https://phabricator.wikimedia.org/P44687 and previous config saved to /var/cache/conftool/dbconfig/20230220-095308-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-02-20T09:55:26Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db2165 T330056', diff saved to https://phabricator.wikimedia.org/P44688 and previous config saved to /var/cache/conftool/dbconfig/20230220-095526-ladsgroup.json

Ladsgroup updated the task description. (Show Details)
Ladsgroup changed the edit policy from "Custom Policy" to "All Users".
Ladsgroup moved this task from In progress to Done on the DBA board.