Page MenuHomePhabricator

Switchover s2 master (db2207 -> db2204)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist

Checklist:

NEW primary: db2204
OLD primary: db2207

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2207.codfw.wmnet h=db2204.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s2 T366038" 'A:db-section-s2'
  • Set NEW primary with weight 0
sudo dbctl instance db2204 set-weight 0
sudo dbctl config commit -m "Set db2204 with weight 0 T366038"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2204 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2204 from API/vslow/dump T366038"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2207 db2204
  • Disable puppet on both nodes
sudo cumin 'db2207* or db2204*' 'disable-puppet "primary switchover T366038"'

Failover:

  • Log the failover:
!log Starting s2 codfw failover from db2207 to db2204 - T366038
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2207 db2204
echo "===== db2207 (OLD)"; sudo db-mysql db2207 -e 'show slave status\G'
echo "===== db2204 (NEW)"; sudo db-mysql db2204 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s2 set-master db2204
sudo dbctl config commit -m "Promote db2204 to s2 primary T366038"
  • Restart puppet on both hosts:
sudo cumin 'db2207* or db2204*' 'run-puppet-agent -e "primary switchover T366038"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2204 heartbeat -e "delete from heartbeat where file like 'db2207%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2204
events_coredb_slave.sql on the new slave db2207
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2207 set-candidate-master --section s2 true
sudo dbctl instance db2204 set-candidate-master --section s2 false
(dborch1001): sudo orchestrator-client -c untag -i db2204 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2207 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's2';"
  • (If needed): Depool db2207 for maintenance.
sudo dbctl instance db2207 depool
sudo dbctl config commit -m "Depool db2207 T366038"
  • Change db2207 weight to mimic the previous weight db2204:
sudo dbctl instance db2207 edit
  • Apply outstanding schema changes to db2207 (if any)
  • Update/resolve this ticket.

Event Timeline

Change #1035872 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2204 to s2 master

https://gerrit.wikimedia.org/r/1035872

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui subscribed.

Waiting for T364985 to be completed

I'd be great if you wait for the T352010: Gradually drop old pagelinks columns too. I'm about to start the schema change on the PK in s2

I'd be great if you wait for the T352010: Gradually drop old pagelinks columns too. I'm about to start the schema change on the PK in s2

Sure, can you priorize codfw first then?

Mentioned in SAL (#wikimedia-operations) [2024-06-05T07:07:51Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366038

Mentioned in SAL (#wikimedia-operations) [2024-06-05T07:07:59Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2204 with weight 0 T366038', diff saved to https://phabricator.wikimedia.org/P64057 and previous config saved to /var/cache/conftool/dbconfig/20240605-070758-root.json

Mentioned in SAL (#wikimedia-operations) [2024-06-05T07:08:15Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366038

Change #1039067 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2204 to s2 master

https://gerrit.wikimedia.org/r/1039067

Change #1035872 abandoned by Marostegui:

[operations/puppet@production] mariadb: Promote db2204 to s2 master

Reason:

https://gerrit.wikimedia.org/r/1035872

Change #1039067 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2204 to s2 master

https://gerrit.wikimedia.org/r/1039067

Mentioned in SAL (#wikimedia-operations) [2024-06-05T07:24:09Z] <marostegui> Starting s2 codfw failover from db2207 to db2204 - T366038

Mentioned in SAL (#wikimedia-operations) [2024-06-05T07:24:28Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2204 to s2 primary T366038', diff saved to https://phabricator.wikimedia.org/P64058 and previous config saved to /var/cache/conftool/dbconfig/20240605-072427-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2024-06-05T07:25:10Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2207 T366038', diff saved to https://phabricator.wikimedia.org/P64059 and previous config saved to /var/cache/conftool/dbconfig/20240605-072509-root.json