Page MenuHomePhabricator

Switchover s6 master (db2114 -> db2129)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist

Checklist:

NEW primary: db2129
OLD primary: db2114

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2114.codfw.wmnet h=db2129.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s6 T355739" 'A:db-section-s6'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2129 set-weight 0
sudo dbctl config commit -m "Set db2129 with weight 0 T355739"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db2114 db2129
  • Disable puppet on both nodes
sudo cumin 'db2114* or db2129*' 'disable-puppet "primary switchover T355739"'

Failover:

  • Log the failover:
!log Starting s6 codfw failover from db2114 to db2129 - T355739
  • Set section read-only:
sudo dbctl --scope eqiad section s6 ro "Maintenance until 06:15 UTC - T355739"
sudo dbctl --scope codfw section s6 ro "Maintenance until 06:15 UTC - T355739"
sudo dbctl config commit -m "Set s6 codfw as read-only for maintenance - T355739"
  • Check s6 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db2114 db2129
echo "===== db2114 (OLD)"; sudo db-mysql db2114 -e 'show slave status\G'
echo "===== db2129 (NEW)"; sudo db-mysql db2129 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section s6 set-master db2129
sudo dbctl --scope eqiad section s6 rw
sudo dbctl --scope codfw section s6 rw
sudo dbctl config commit -m "Promote db2129 to s6 primary and set section read-write T355739"
  • Restart puppet on both hosts:
sudo cumin 'db2114* or db2129*' 'run-puppet-agent -e "primary switchover T355739"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2129 heartbeat -e "delete from heartbeat where file like 'db2114%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2129
events_coredb_slave.sql on the new slave db2114
sudo dbctl instance db2114 set-candidate-master --section s6 true
sudo dbctl instance db2129 set-candidate-master --section s6 false
(dborch1001): sudo orchestrator-client -c untag -i db2129 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2114 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's6';"
  • (If needed): Depool db2114 for maintenance.
sudo dbctl instance db2114 depool
sudo dbctl config commit -m "Depool db2114 T355739"
  • Change db2114 weight to mimic the previous weight db2129:
sudo dbctl instance db2114 edit
  • Apply outstanding schema changes to db2114 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 992442 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2129 to s6 master

https://gerrit.wikimedia.org/r/992442

Change 992443 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/992443

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

Will be done on Tuesday next week

Mentioned in SAL (#wikimedia-operations) [2024-01-30T05:19:20Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355739

Mentioned in SAL (#wikimedia-operations) [2024-01-30T05:19:45Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355739

Mentioned in SAL (#wikimedia-operations) [2024-01-30T05:19:53Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2129 with weight 0 T355739', diff saved to https://phabricator.wikimedia.org/P55844 and previous config saved to /var/cache/conftool/dbconfig/20240130-051952-marostegui.json

"db2117": 200,                                                     "db2117": 200,
"db2124": 400,                                                     "db2124": 400,
"db2129": 400,                                                     "db2129": 0,
"db2151": 300,                                                     "db2151": 300,
"db2158": 300,                                                     "db2158": 300,
"db2171:3316": 150,                                                "db2171:3316": 150,
"db2180": 100,                                                     "db2180": 100,
"db2193": 100                                                      "db2193": 100

Change 992442 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2129 to s6 master

https://gerrit.wikimedia.org/r/992442

Mentioned in SAL (#wikimedia-operations) [2024-01-30T05:40:05Z] <marostegui> Starting s6 codfw failover from db2114 to db2129 - T355739

Mentioned in SAL (#wikimedia-operations) [2024-01-30T05:40:26Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s6 codfw as read-only for maintenance - T355739', diff saved to https://phabricator.wikimedia.org/P55845 and previous config saved to /var/cache/conftool/dbconfig/20240130-054025-root.json

Mentioned in SAL (#wikimedia-operations) [2024-01-30T05:40:54Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2129 to s6 primary and set section read-write T355739', diff saved to https://phabricator.wikimedia.org/P55846 and previous config saved to /var/cache/conftool/dbconfig/20240130-054053-root.json

Mentioned in SAL (#wikimedia-operations) [2024-01-30T05:41:55Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2114 T355739', diff saved to https://phabricator.wikimedia.org/P55847 and previous config saved to /var/cache/conftool/dbconfig/20240130-054154-root.json

Change 992443 merged by Marostegui:

[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/992443

Marostegui updated the task description. (Show Details)

Done