Page MenuHomePhabricator

Switchover s2 master (db1122 -> db1162)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows: 26th April at 06:00 AM UTC

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist

Checklist:

NEW primary: db1162
OLD primary: db1122

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1122.eqiad.wmnet h=db1162.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s2 T306417" 'A:db-section-s2'
  • Set NEW primary with weight 0
sudo dbctl instance db1162 set-weight 0
sudo dbctl config commit -m "Set db1162 with weight 0 T306417"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1122 db1162
  • Disable puppet on both nodes
sudo cumin 'db1122* or db1162*' 'disable-puppet "primary switchover T306417"'

Failover:

  • Log the failover:
!log Starting s2 eqiad failover from db1122 to db1162 - T306417
  • Set section read-only:
sudo dbctl --scope eqiad section s2 ro "Maintenance until 06:15 UTC - T306417"
sudo dbctl config commit -m "Set s2 eqiad as read-only for maintenance - T306417"
  • Check s2 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1122 db1162
echo "===== db1122 (OLD)"; sudo db-mysql db1122 -e 'show slave status\G'
echo "===== db1162 (NEW)"; sudo db-mysql db1162 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s2 set-master db1162
sudo dbctl --scope eqiad section s2 rw
sudo dbctl config commit -m "Promote db1162 to s2 primary and set section read-write T306417"
  • Restart puppet on both hosts:
sudo cumin 'db1122* or db1162*' 'run-puppet-agent -e "primary switchover T306417"'

Clean up tasks:

  • Clean up heartbeat table(s): delete from heartbeat.heartbeat where server_id=171978786;
  • change events for query killer:
events_coredb_master.sql on the new primary db1162
events_coredb_slave.sql on the new slave db1122
sudo dbctl instance db1122 set-candidate-master --section s2 true
sudo dbctl instance db1162 set-candidate-master --section s2 false
(dborch1001): sudo orchestrator-client -c untag -i db1162 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1122 --tag name=candidate
sudo dbctl instance db1122 depool
sudo dbctl config commit -m "Depool db1122 T306417"
  • Apply outstanding schema changes to db1122 (if any). T306417#7879562
  • Update/resolve this ticket.

Event Timeline

Marostegui renamed this task from Switchover s2 master (db1122 -> db1162 to Switchover s2 master (db1122 -> db1162).Apr 19 2022, 7:59 AM
Marostegui triaged this task as Medium priority.
Marostegui updated Other Assignee, added: Ladsgroup.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

db1162 needs to be rebooted to pick the right kernel for T303174

db1162 needs to be rebooted to pick the right kernel for T303174

Done

While doing T298565 I made a mess in s2 in user table (user_email_token_expires field) because of https://gerrit.wikimedia.org/r/c/operations/software/schema-changes/+/774772/2/2022/fix_user_varbinaries_T298565.py#13

I make sure to get it done before the switchover.

It can be postponed if needed. Don't worry about it at all

Nah I should have fixed it long time ago.

Started the schema change, it'll be done by tomorrow.

Change 785602 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1162 to s2 master

https://gerrit.wikimedia.org/r/785602

Change 785603 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s2-master alias

https://gerrit.wikimedia.org/r/785603

Mentioned in SAL (#wikimedia-operations) [2022-04-26T04:53:36Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s2 T306417

Mentioned in SAL (#wikimedia-operations) [2022-04-26T04:53:51Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s2 T306417

Mentioned in SAL (#wikimedia-operations) [2022-04-26T04:54:07Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1162 with weight 0 T306417', diff saved to https://phabricator.wikimedia.org/P26498 and previous config saved to /var/cache/conftool/dbconfig/20220426-045406-root.json

Change 785602 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1162 to s2 master

https://gerrit.wikimedia.org/r/785602

Change 786162 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1122: Disable notifications

https://gerrit.wikimedia.org/r/786162

Change 786162 merged by Marostegui:

[operations/puppet@production] db1122: Disable notifications

https://gerrit.wikimedia.org/r/786162

Mentioned in SAL (#wikimedia-operations) [2022-04-26T06:00:17Z] <marostegui> Starting s2 eqiad failover from db1122 to db1162 - T306417

Mentioned in SAL (#wikimedia-operations) [2022-04-26T06:00:34Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T306417', diff saved to https://phabricator.wikimedia.org/P26500 and previous config saved to /var/cache/conftool/dbconfig/20220426-060033-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-04-26T06:00:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1162 to s2 primary and set section read-write T306417', diff saved to https://phabricator.wikimedia.org/P26501 and previous config saved to /var/cache/conftool/dbconfig/20220426-060058-marostegui.json

Change 785603 merged by Marostegui:

[operations/dns@master] wmnet: Update s2-master alias

https://gerrit.wikimedia.org/r/785603

Mentioned in SAL (#wikimedia-operations) [2022-04-26T06:03:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1122 T306417', diff saved to https://phabricator.wikimedia.org/P26502 and previous config saved to /var/cache/conftool/dbconfig/20220426-060344-root.json

Mentioned in SAL (#wikimedia-operations) [2022-04-26T06:06:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db1162 is current s2 master, should not be in API T306417', diff saved to https://phabricator.wikimedia.org/P26503 and previous config saved to /var/cache/conftool/dbconfig/20220426-060602-marostegui.json

Marostegui updated the task description. (Show Details)

Switchover was done. RO times:

  • Start: 06:00:34
  • Stop: 06:00:59

Read only time: 25 seconds

The old master schema changes will be tracked on their own task.