Page MenuHomePhabricator

Switchover s8 master (db1109 -> db1126)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s8.dblist

Checklist:

NEW primary: db1126
OLD primary: db1109

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1109.eqiad.wmnet h=db1126.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s8 T330991" 'A:db-section-s8'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1126 set-weight 0
sudo dbctl config commit -m "Set db1126 with weight 0 T330991"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db1109 db1126
  • Disable puppet on both nodes
sudo cumin 'db1109* or db1126*' 'disable-puppet "primary switchover T330991"'

Failover:

  • Log the failover:
!log Starting s8 eqiad failover from db1109 to db1126 - T330991
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db1109 db1126
echo "===== db1109 (OLD)"; sudo db-mysql db1109 -e 'show slave status\G'
echo "===== db1126 (NEW)"; sudo db-mysql db1126 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope eqiad section s8 set-master db1126
sudo dbctl config commit -m "Promote db1126 to s8 primary T330991"
  • Restart puppet on both hosts:
sudo cumin 'db1109* or db1126*' 'run-puppet-agent -e "primary switchover T330991"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1126 heartbeat -e "delete from heartbeat where file like 'db1109%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1126
events_coredb_slave.sql on the new slave db1109
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1109 set-candidate-master --section s8 true
sudo dbctl instance db1126 set-candidate-master --section s8 false
(dborch1001): sudo orchestrator-client -c untag -i db1126 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1109 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's8';"
  • (If needed): Depool db1109 for maintenance.
sudo dbctl instance db1109 depool
sudo dbctl config commit -m "Depool db1109 T330991"
  • Change db1109 weight to mimic the previous weight db1126:
sudo dbctl instance db1109 edit
  • Update/resolve this ticket.

Event Timeline

Change 893431 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1126 to s8 master

https://gerrit.wikimedia.org/r/893431

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Blocked on the DBA board.

To be done after the eqiad row A switch maintenance is done (T329073) as the candidate master is on row A

Thanks <3 I will make the script figure out the primary dc properly instead of hard-coding it.

Mentioned in SAL (#wikimedia-operations) [2023-03-08T07:05:08Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T330991

Mentioned in SAL (#wikimedia-operations) [2023-03-08T07:05:31Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T330991

Mentioned in SAL (#wikimedia-operations) [2023-03-08T07:05:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1126 with weight 0 T330991', diff saved to https://phabricator.wikimedia.org/P45391 and previous config saved to /var/cache/conftool/dbconfig/20230308-070544-root.json

Change 893431 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1126 to s8 master

https://gerrit.wikimedia.org/r/893431

Mentioned in SAL (#wikimedia-operations) [2023-03-08T07:29:30Z] <marostegui> Starting s8 eqiad failover from db1109 to db1126 - T330991

Mentioned in SAL (#wikimedia-operations) [2023-03-08T07:30:06Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1126 to s8 primary T330991', diff saved to https://phabricator.wikimedia.org/P45394 and previous config saved to /var/cache/conftool/dbconfig/20230308-073005-root.json

Mentioned in SAL (#wikimedia-operations) [2023-03-08T07:31:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1109 T330991', diff saved to https://phabricator.wikimedia.org/P45395 and previous config saved to /var/cache/conftool/dbconfig/20230308-073110-root.json

Marostegui added a parent task: Restricted Task.

All done and db1109 rebooted for kernel upgrade.