Page MenuHomePhabricator

Switchover s3 master (db1223 -> db1157)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist

Checklist:

NEW primary: db1157
OLD primary: db1223

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1223.eqiad.wmnet h=db1157.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s3 T367140" 'A:db-section-s3'
  • Set NEW primary with weight 0
sudo dbctl instance db1157 set-weight 0
sudo dbctl config commit -m "Set db1157 with weight 0 T367140"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1157 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1157 from API/vslow/dump T367140"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1223 db1157
  • Disable puppet on both nodes
sudo cumin 'db1223* or db1157*' 'disable-puppet "primary switchover T367140"'

Failover:

  • Log the failover:
!log Starting s3 eqiad failover from db1223 to db1157 - T367140
  • Set section read-only:
sudo dbctl --scope eqiad section s3 ro "Maintenance until 06:15 UTC - T367140"
sudo dbctl --scope codfw section s3 ro "Maintenance until 06:15 UTC - T367140"
sudo dbctl config commit -m "Set s3 eqiad as read-only for maintenance - T367140"
  • Check s3 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1223 db1157
echo "===== db1223 (OLD)"; sudo db-mysql db1223 -e 'show slave status\G'
echo "===== db1157 (NEW)"; sudo db-mysql db1157 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s3 set-master db1157
sudo dbctl --scope eqiad section s3 rw
sudo dbctl --scope codfw section s3 rw
sudo dbctl config commit -m "Promote db1157 to s3 primary and set section read-write T367140"
  • Restart puppet on both hosts:
sudo cumin 'db1223* or db1157*' 'run-puppet-agent -e "primary switchover T367140"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1157 heartbeat -e "delete from heartbeat where file like 'db1223%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1157
events_coredb_slave.sql on the new slave db1223
sudo dbctl instance db1223 set-candidate-master --section s3 true
sudo dbctl instance db1157 set-candidate-master --section s3 false
(dborch1001): sudo orchestrator-client -c untag -i db1157 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1223 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's3';"
  • (If needed): Depool db1223 for maintenance.
sudo dbctl instance db1223 depool
sudo dbctl config commit -m "Depool db1223 T367140"
  • Change db1223 weight to mimic the previous weight db1157:
sudo dbctl instance db1223 edit
  • Update/resolve this ticket.

Event Timeline

Change #1041363 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1157 to s3 master

https://gerrit.wikimedia.org/r/1041363

Change #1041364 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/1041364

Mentioned in SAL (#wikimedia-operations) [2024-06-11T05:03:45Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T367140

Mentioned in SAL (#wikimedia-operations) [2024-06-11T05:03:52Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1157 with weight 0 T367140', diff saved to https://phabricator.wikimedia.org/P64575 and previous config saved to /var/cache/conftool/dbconfig/20240611-050351-root.json

Mentioned in SAL (#wikimedia-operations) [2024-06-11T05:04:05Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T367140

Change #1041363 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1157 to s3 master

https://gerrit.wikimedia.org/r/1041363

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-06-11T05:19:20Z] <marostegui> Starting s3 eqiad failover from db1223 to db1157 - T367140

Mentioned in SAL (#wikimedia-operations) [2024-06-11T05:19:41Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T367140', diff saved to https://phabricator.wikimedia.org/P64577 and previous config saved to /var/cache/conftool/dbconfig/20240611-051941-root.json

Mentioned in SAL (#wikimedia-operations) [2024-06-11T05:20:01Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1157 to s3 primary and set section read-write T367140', diff saved to https://phabricator.wikimedia.org/P64578 and previous config saved to /var/cache/conftool/dbconfig/20240611-052000-root.json

Change #1041364 merged by Marostegui:

[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/1041364

Mentioned in SAL (#wikimedia-operations) [2024-06-11T05:21:02Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1223 T367140', diff saved to https://phabricator.wikimedia.org/P64579 and previous config saved to /var/cache/conftool/dbconfig/20240611-052101-root.json

Marostegui updated the task description. (Show Details)

This is done