Page MenuHomePhabricator

Switchover s3 master (db1157 -> db1223)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist

Checklist:

NEW primary: db1223
OLD primary: db1157

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1157.eqiad.wmnet h=db1223.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s3 T370019" 'A:db-section-s3'
  • Set NEW primary with weight 0
sudo dbctl instance db1223 set-weight 0
sudo dbctl config commit -m "Set db1223 with weight 0 T370019"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1223 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1223 from API/vslow/dump T370019"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1157 db1223
  • Disable puppet on both nodes
sudo cumin 'db1157* or db1223*' 'disable-puppet "primary switchover T370019"'

Failover:

  • Log the failover:
!log Starting s3 eqiad failover from db1157 to db1223 - T370019
  • Set section read-only:
sudo dbctl --scope eqiad section s3 ro "Maintenance until 06:15 UTC - T370019"
sudo dbctl --scope codfw section s3 ro "Maintenance until 06:15 UTC - T370019"
sudo dbctl config commit -m "Set s3 eqiad as read-only for maintenance - T370019"
  • Check s3 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1157 db1223
echo "===== db1157 (OLD)"; sudo db-mysql db1157 -e 'show slave status\G'
echo "===== db1223 (NEW)"; sudo db-mysql db1223 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s3 set-master db1223
sudo dbctl --scope eqiad section s3 rw
sudo dbctl --scope codfw section s3 rw
sudo dbctl config commit -m "Promote db1223 to s3 primary and set section read-write T370019"
  • Restart puppet on both hosts:
sudo cumin 'db1157* or db1223*' 'run-puppet-agent -e "primary switchover T370019"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1223 heartbeat -e "delete from heartbeat where file like 'db1157%';"
  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db1223
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db1157
sudo dbctl instance db1157 set-candidate-master --section s3 true
sudo dbctl instance db1223 set-candidate-master --section s3 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db1223 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db1157 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's3';"
  • (If needed): Depool db1157 for maintenance.
sudo dbctl instance db1157 depool
sudo dbctl config commit -m "Depool db1157 T370019"
  • Change db1157 weight to mimic the previous weight db1223:
sudo dbctl instance db1157 edit
  • Update/resolve this ticket.

Event Timeline

Change #1054076 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1223 to s3 master

https://gerrit.wikimedia.org/r/1054076

Change #1054077 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/1054077

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-07-16T04:57:38Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s3 T370019

Mentioned in SAL (#wikimedia-operations) [2024-07-16T04:57:59Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T370019

Mentioned in SAL (#wikimedia-operations) [2024-07-16T04:58:40Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1223 with weight 0 T370019', diff saved to https://phabricator.wikimedia.org/P66578 and previous config saved to /var/cache/conftool/dbconfig/20240716-045839-root.json

Change #1054076 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1223 to s3 master

https://gerrit.wikimedia.org/r/1054076

Mentioned in SAL (#wikimedia-operations) [2024-07-16T05:15:00Z] <marostegui> Starting s3 eqiad failover from db1157 to db1223 - T370019

Mentioned in SAL (#wikimedia-operations) [2024-07-16T05:15:17Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T370019', diff saved to https://phabricator.wikimedia.org/P66579 and previous config saved to /var/cache/conftool/dbconfig/20240716-051516-root.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T05:15:39Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1223 to s3 primary and set section read-write T370019', diff saved to https://phabricator.wikimedia.org/P66580 and previous config saved to /var/cache/conftool/dbconfig/20240716-051538-root.json

Change #1054077 merged by Marostegui:

[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/1054077

Mentioned in SAL (#wikimedia-operations) [2024-07-16T05:17:19Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1157 T370019', diff saved to https://phabricator.wikimedia.org/P66581 and previous config saved to /var/cache/conftool/dbconfig/20240716-051718-root.json

Marostegui updated the task description. (Show Details)

This is done