Page MenuHomePhabricator

Switchover s6 master (db2129 -> db2214)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist

Checklist:

NEW primary: db2214
OLD primary: db2129

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2129.codfw.wmnet h=db2214.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s6 T365783" 'A:db-section-s6'
  • Set NEW primary with weight 0
sudo dbctl instance db2214 set-weight 0
sudo dbctl config commit -m "Set db2214 with weight 0 T365783"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2214 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2214 from API/vslow/dump T365783"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2129 db2214
  • Disable puppet on both nodes
sudo cumin 'db2129* or db2214*' 'disable-puppet "primary switchover T365783"'

Failover:

  • Log the failover:
!log Starting s6 codfw failover from db2129 to db2214 - T365783
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2129 db2214
echo "===== db2129 (OLD)"; sudo db-mysql db2129 -e 'show slave status\G'
echo "===== db2214 (NEW)"; sudo db-mysql db2214 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s6 set-master db2214
sudo dbctl config commit -m "Promote db2214 to s6 primary T365783"
  • Restart puppet on both hosts:
sudo cumin 'db2129* or db2214*' 'run-puppet-agent -e "primary switchover T365783"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2214 heartbeat -e "delete from heartbeat where file like 'db2129%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2214
events_coredb_slave.sql on the new slave db2129
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2129 set-candidate-master --section s6 true
sudo dbctl instance db2214 set-candidate-master --section s6 false
(dborch1001): sudo orchestrator-client -c untag -i db2214 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2129 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's6';"
  • (If needed): Depool db2129 for maintenance.
sudo dbctl instance db2129 depool
sudo dbctl config commit -m "Depool db2129 T365783"
  • Change db2129 weight to mimic the previous weight db2214:
sudo dbctl instance db2129 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1034939 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2214 to s6 master

https://gerrit.wikimedia.org/r/1034939

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-05-27T07:35:30Z] <root@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T365783

Mentioned in SAL (#wikimedia-operations) [2024-05-27T07:35:46Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2214 with weight 0 T365783', diff saved to https://phabricator.wikimedia.org/P63262 and previous config saved to /var/cache/conftool/dbconfig/20240527-073545-root.json

Mentioned in SAL (#wikimedia-operations) [2024-05-27T07:35:53Z] <root@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T365783

Change #1034939 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2214 to s6 master

https://gerrit.wikimedia.org/r/1034939

Mentioned in SAL (#wikimedia-operations) [2024-05-27T07:54:43Z] <marostegui> Starting s6 codfw failover from db2129 to db2214 - T365783

Mentioned in SAL (#wikimedia-operations) [2024-05-27T07:55:12Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2214 to s6 primary T365783', diff saved to https://phabricator.wikimedia.org/P63266 and previous config saved to /var/cache/conftool/dbconfig/20240527-075512-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2024-05-27T07:56:02Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2129 T365783', diff saved to https://phabricator.wikimedia.org/P63268 and previous config saved to /var/cache/conftool/dbconfig/20240527-075602-root.json

Marostegui updated the task description. (Show Details)

Switchover done