Page MenuHomePhabricator

Switchover s2 master (db1162 -> db1222)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist

Checklist:

NEW primary: db1222
OLD primary: db1162

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1162.eqiad.wmnet h=db1222.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s2 T362036" 'A:db-section-s2'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1222 set-weight 0
sudo dbctl config commit -m "Set db1222 with weight 0 T362036"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1162 db1222
  • Disable puppet on both nodes
sudo cumin 'db1162* or db1222*' 'disable-puppet "primary switchover T362036"'

Failover:

  • Log the failover:
!log Starting s2 eqiad failover from db1162 to db1222 - T362036
  • Set section read-only:
sudo dbctl --scope eqiad section s2 ro "Maintenance until 06:15 UTC - T362036"
sudo dbctl --scope codfw section s2 ro "Maintenance until 06:15 UTC - T362036"
sudo dbctl config commit -m "Set s2 eqiad as read-only for maintenance - T362036"
  • Check s2 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1162 db1222
echo "===== db1162 (OLD)"; sudo db-mysql db1162 -e 'show slave status\G'
echo "===== db1222 (NEW)"; sudo db-mysql db1222 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s2 set-master db1222
sudo dbctl --scope eqiad section s2 rw
sudo dbctl --scope codfw section s2 rw
sudo dbctl config commit -m "Promote db1222 to s2 primary and set section read-write T362036"
  • Restart puppet on both hosts:
sudo cumin 'db1162* or db1222*' 'run-puppet-agent -e "primary switchover T362036"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1222 heartbeat -e "delete from heartbeat where file like 'db1162%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1222
events_coredb_slave.sql on the new slave db1162
sudo dbctl instance db1162 set-candidate-master --section s2 true
sudo dbctl instance db1222 set-candidate-master --section s2 false
(dborch1001): sudo orchestrator-client -c untag -i db1222 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1162 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's2';"
  • (If needed): Depool db1162 for maintenance.
sudo dbctl instance db1162 depool
sudo dbctl config commit -m "Depool db1162 T362036"
  • Change db1162 weight to mimic the previous weight db1222:
sudo dbctl instance db1162 edit
  • Apply outstanding schema changes to db1162 (if any)
  • Update/resolve this ticket.

Event Timeline

Change #1017451 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1222 to s2 master

https://gerrit.wikimedia.org/r/1017451

Change #1017452 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s2-master alias

https://gerrit.wikimedia.org/r/1017452

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-04-09T05:09:55Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s2 T362036

Mentioned in SAL (#wikimedia-operations) [2024-04-09T05:10:19Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s2 T362036

Mentioned in SAL (#wikimedia-operations) [2024-04-09T05:10:28Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1222 with weight 0 T362036', diff saved to https://phabricator.wikimedia.org/P59983 and previous config saved to /var/cache/conftool/dbconfig/20240409-051027-marostegui.json

Change #1017451 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1222 to s2 master

https://gerrit.wikimedia.org/r/1017451

Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-04-09T05:28:14Z] <marostegui> Starting s2 eqiad failover from db1162 to db1222 - T362036

Mentioned in SAL (#wikimedia-operations) [2024-04-09T05:28:28Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T362036', diff saved to https://phabricator.wikimedia.org/P59985 and previous config saved to /var/cache/conftool/dbconfig/20240409-052827-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2024-04-09T05:28:55Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1222 to s2 primary and set section read-write T362036', diff saved to https://phabricator.wikimedia.org/P59986 and previous config saved to /var/cache/conftool/dbconfig/20240409-052855-marostegui.json

Change #1017452 merged by Marostegui:

[operations/dns@master] wmnet: Update s2-master alias

https://gerrit.wikimedia.org/r/1017452

Mentioned in SAL (#wikimedia-operations) [2024-04-09T05:30:12Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1162 T362036', diff saved to https://phabricator.wikimedia.org/P59988 and previous config saved to /var/cache/conftool/dbconfig/20240409-053005-root.json

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

This is done

Mentioned in SAL (#wikimedia-operations) [2024-04-10T10:17:46Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1175 T362036', diff saved to https://phabricator.wikimedia.org/P60210 and previous config saved to /var/cache/conftool/dbconfig/20240410-101746-root.json