Page MenuHomePhabricator

Switchover s8 master (db1104 -> db1109)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s8.dblist

Checklist:

NEW primary: db1109
OLD primary: db1104

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1104.eqiad.wmnet h=db1109.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s8 T314369" 'A:db-section-s8'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1109 set-weight 0
sudo dbctl config commit -m "Set db1109 with weight 0 T314369"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1104 db1109
  • Disable puppet on both nodes
sudo cumin 'db1104* or db1109*' 'disable-puppet "primary switchover T314369"'

Failover:

  • Log the failover:
!log Starting s8 eqiad failover from db1104 to db1109 - T314369
  • Set section read-only:
sudo dbctl --scope eqiad section s8 ro "Maintenance until 06:15 UTC - T314369"
sudo dbctl config commit -m "Set s8 eqiad as read-only for maintenance - T314369"
  • Check s8 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1104 db1109
echo "===== db1104 (OLD)"; sudo db-mysql db1104 -e 'show slave status\G'
echo "===== db1109 (NEW)"; sudo db-mysql db1109 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s8 set-master db1109
sudo dbctl --scope eqiad section s8 rw
sudo dbctl config commit -m "Promote db1109 to s8 primary and set section read-write T314369"
  • Restart puppet on both hosts:
sudo cumin 'db1104* or db1109*' 'run-puppet-agent -e "primary switchover T314369"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1109 heartbeat -e "delete from heartbeat where file like 'db1104%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1109
events_coredb_slave.sql on the new slave db1104
sudo dbctl instance db1104 set-candidate-master --section s8 true
sudo dbctl instance db1109 set-candidate-master --section s8 false
(dborch1001): sudo orchestrator-client -c untag -i db1109 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1104 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's8';"
  • (If needed): Depool db1104 for maintenance.
sudo dbctl instance db1104 depool
sudo dbctl config commit -m "Depool db1104 T314369"
  • Change db1104 weight to mimic the previous weight db1109:
sudo dbctl instance db1104 edit
  • Apply outstanding schema changes to db1104 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 819548 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1109 to s8 master

https://gerrit.wikimedia.org/r/819548

Change 819549 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s8-master alias

https://gerrit.wikimedia.org/r/819549

Ladsgroup subscribed.

Scheduled for next Tuesday (18th August)

Ladsgroup moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2022-08-18T04:50:49Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 31 hosts with reason: Primary switchover s8 T314369

Mentioned in SAL (#wikimedia-operations) [2022-08-18T04:51:21Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 31 hosts with reason: Primary switchover s8 T314369

Mentioned in SAL (#wikimedia-operations) [2022-08-18T04:52:19Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1109 with weight 0 T314369', diff saved to https://phabricator.wikimedia.org/P32471 and previous config saved to /var/cache/conftool/dbconfig/20220818-045218-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-18T04:54:22Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 31 hosts with reason: Primary switchover s8 T314369

Mentioned in SAL (#wikimedia-operations) [2022-08-18T04:54:43Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Primary switchover s8 T314369

Change 819548 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1109 to s8 master

https://gerrit.wikimedia.org/r/819548

Mentioned in SAL (#wikimedia-operations) [2022-08-18T06:01:23Z] <Amir1> Starting s8 eqiad failover from db1104 to db1109 - T314369

Mentioned in SAL (#wikimedia-operations) [2022-08-18T06:01:37Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T314369', diff saved to https://phabricator.wikimedia.org/P32474 and previous config saved to /var/cache/conftool/dbconfig/20220818-060137-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-18T06:02:13Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1109 to s8 primary and set section read-write T314369', diff saved to https://phabricator.wikimedia.org/P32475 and previous config saved to /var/cache/conftool/dbconfig/20220818-060213-ladsgroup.json

Change 819549 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s8-master alias

https://gerrit.wikimedia.org/r/819549

Mentioned in SAL (#wikimedia-operations) [2022-08-18T06:07:07Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1104 T314369', diff saved to https://phabricator.wikimedia.org/P32476 and previous config saved to /var/cache/conftool/dbconfig/20220818-060707-ladsgroup.json

Ladsgroup updated the task description. (Show Details)

Read-only time: 29 seconds

@Ladsgroup when you've got time please link the gerrit patches on this task description (for those with FIXME), it is useful when someone wants to review it in the future

@Ladsgroup when you've got time please link the gerrit patches on this task description (for those with FIXME), it is useful when someone wants to review it in the future

Okay, done but now phabricator directly shows tickets connected below the task description. It wasn't the case one or two years ago.