Page MenuHomePhabricator

Switchover s1 master (db1118 -> db1163)
Closed, ResolvedPublic

Description

When: Thursday May 19 6:00 UTC

Checklist:

  • Create a task to communicate the chosen date and send an announcement to the community: FIXME
  • Create a calendar entry for the maintenance, invite sre-data-persistence@
  • Add to deployments calendar. E.g.:
{{Deployment calendar event card
    |when=2021-08-24 23:00 SF
    |length=0.5
    |window=Database primary switchover for s7
    |who={{ircnick|kormat|Stevie Beth Mhaol}}, {{ircnick|marostegui|Manuel 'Early Bird' Arostegui}}, {{ircnick|Amir1|Amir}}
    |what=https://phabricator.wikimedia.org/T301312
}}

NEW primary: db1163
OLD primary: db1118

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1118.eqiad.wmnet h=db1163.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s1 T301312" 'A:db-section-s1'
  • Set NEW primary with weight 0
sudo dbctl instance db1163 set-weight 0
sudo dbctl config commit -m "Set db1163 with weight 0 T301312"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1118 db1163
  • Disable puppet on both nodes
sudo cumin 'db1118* or db1163*' 'disable-puppet "primary switchover T301312"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s1 eqiad failover from db1118 to db1163 - T301312
  • Set section read-only:
sudo dbctl --scope eqiad section s1 ro "Maintenance until 06:15 UTC - T301312"
sudo dbctl config commit -m "Set s1 eqiad as read-only for maintenance - T301312"
  • Check s1 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1118 db1163
echo "===== db1118 (OLD)"; sudo db-mysql db1118 -e 'show slave status\G'
echo "===== db1163 (NEW)"; sudo db-mysql db1163 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s1 set-master db1163
sudo dbctl --scope eqiad section s1 rw
sudo dbctl config commit -m "Promote db1163 to s1 primary and set section read-write T301312"
  • Restart puppet on both hosts:
sudo cumin 'db1118* or db1163*' 'run-puppet-agent -e "primary switchover T301312"'

Clean up tasks:

  • Clean up heartbeat table(s). delete from heartbeat.heartbeat where server_id=171970572
  • change events for query killer:
events_coredb_master.sql on the new primary db1163
events_coredb_slave.sql on the new slave db1118
  • Update DNS: FIXME
  • Update candidate primary dbctl notes
sudo dbctl instance db1118 set-candidate-master --section s1 true
sudo dbctl instance db1163 set-candidate-master --section s1 false
sudo dbctl instance db1118 depool
sudo dbctl config commit -m "Depool db1118 T301312"

Related Objects

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui updated Other Assignee, added: Marostegui.
Marostegui added a project: DBA.

@Kormat let's make sure db1163 is running Bullseye before promoting it to master

Just to add on the pile of wishes. Can we make sure T300775 is applied on db1163 as well?

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Ladsgroup moved this task from Ready to In progress on the DBA board.
Ladsgroup added a project: User-notice.
Ladsgroup added a subscriber: Kormat.

Stevie Beth is out sick. So I'm taking over. Will do it this Thursday.

Change 792707 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/792707

Change 792708 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/792708

Change 792709 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/792709

Mentioned in SAL (#wikimedia-operations) [2022-05-19T05:24:21Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s1 T301312

Mentioned in SAL (#wikimedia-operations) [2022-05-19T05:24:42Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s1 T301312

Mentioned in SAL (#wikimedia-operations) [2022-05-19T05:25:18Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1163 with weight 0 T301312', diff saved to https://phabricator.wikimedia.org/P28059 and previous config saved to /var/cache/conftool/dbconfig/20220519-052517-ladsgroup.json

Change 792707 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/792707

This task is quite old so it is following the old template, make sure to run this also once the switchover is done:

(dborch1001): sudo orchestrator-client -c untag -i db1163 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1118 --tag name=candidate

delete from heartbeat.heartbeat where server_id=171970572

Mentioned in SAL (#wikimedia-operations) [2022-05-19T06:00:08Z] <Amir1> Starting s1 eqiad failover from db1118 to db1163 - T301312

Mentioned in SAL (#wikimedia-operations) [2022-05-19T06:00:24Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T301312', diff saved to https://phabricator.wikimedia.org/P28063 and previous config saved to /var/cache/conftool/dbconfig/20220519-060023-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-19T06:01:20Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T301312', diff saved to https://phabricator.wikimedia.org/P28064 and previous config saved to /var/cache/conftool/dbconfig/20220519-060119-ladsgroup.json

Change 792709 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/792709

Mentioned in SAL (#wikimedia-operations) [2022-05-19T06:05:42Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1118 T301312', diff saved to https://phabricator.wikimedia.org/P28066 and previous config saved to /var/cache/conftool/dbconfig/20220519-060542-ladsgroup.json

Change 792708 merged by Ladsgroup:

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/792708

Quiddity subscribed.

(Was too late for last week's Tech News. Untagging.)

@Ladsgroup some of the steps were not marked as done. Not sure if they were forgotten or they were checked but not marked. I have double checked them