Page MenuHomePhabricator

Switchover s8 from db1104 to db1109
Closed, ResolvedPublic

Description

When: 11th Nov at 06:00 AM UTC

Checklist:

  • Create a task to communicate the chosen date and send an announcement to the community: T294322
  • Create a calendar entry for the maintenance, invite sre-data-persistence@
  • Add to deployments calendar. E.g.:
{{Deployment calendar event card
    |when=2021-11-10 22:00 SF
    |length=0.5
    |window=Database primary switchover for s8
    |who={{ircnick|kormat|Stevie Beth Mhaol}}, {{ircnick|marostegui|Manuel 'Early Bird' Arostegui}}
    |what=https://phabricator.wikimedia.org/T294321
}}

NEW primary: db1109
OLD primary: db1104

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1104.eqiad.wmnet h=db1109.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s8 T294321" 'A:db-section-s8'
  • Set NEW primary with weight 0
sudo dbctl instance db1109 set-weight 0
sudo dbctl config commit -m "Set db1109 with weight 0 T294321"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move db1104 db1109
  • Disable puppet on both nodes
sudo cumin 'db1104* or db1109*' 'disable-puppet "primary switchover T294321"'

Failover:

  • Log the failover:
!log Starting s8 eqiad failover from db1104 to db1109 - T294321
  • Set section read-only:
sudo dbctl --scope eqiad section s8 ro "Maintenance until 06:15 UTC - T294321"
sudo dbctl config commit -m "Set s8 eqiad as read-only for maintenance - T294321"
  • Check s8 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1104 db1109
echo "===== db1104 (OLD)"; sudo mysql.py -h db1104 -e 'show slave status\G'
echo "===== db1109 (NEW)"; sudo mysql.py -h db1109 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s8 set-master db1109
sudo dbctl --scope eqiad section s8 rw
sudo dbctl config commit -m "Promote db1109 to s8 primary and set section read-write T294321"
  • Restart puppet on both hosts:
sudo cumin 'db1104* or db1109*' 'run-puppet-agent -e "primary switchover T294321"'

Clean up tasks:

  • Clean up heartbeat table(s).

[x change events for query killer:

events_coredb_master.sql on the new primary db1109
events_coredb_slave.sql on the new slave db1104
sudo dbctl instance db1104 set-candidate-master --section s8 true
sudo dbctl instance db1109 set-candidate-master --section s8 false
  • Check tendril was updated
  • Check zarcillo was updated
  • Depool OLD primary, as it's running 10.1, replicating from a 10.4 primary
sudo dbctl instance db1104 depool
sudo dbctl config commit -m "Depool db1104 until it's reimaged to buster T294321"
  • Update/resolve this ticket.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Change 737831 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1109 as s8 master

https://gerrit.wikimedia.org/r/737831

Change 737832 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s8-master

https://gerrit.wikimedia.org/r/737832

Mentioned in SAL (#wikimedia-operations) [2021-11-10T06:41:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1109 with weight 0 T294321', diff saved to https://phabricator.wikimedia.org/P17715 and previous config saved to /var/cache/conftool/dbconfig/20211110-064120-root.json

Mentioned in SAL (#wikimedia-operations) [2021-11-11T05:13:51Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 31 hosts with reason: Primary switchover s8 T294321

Mentioned in SAL (#wikimedia-operations) [2021-11-11T05:14:14Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 31 hosts with reason: Primary switchover s8 T294321

Change 737831 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1109 as s8 master

https://gerrit.wikimedia.org/r/737831

Mentioned in SAL (#wikimedia-operations) [2021-11-11T06:00:22Z] <marostegui> Starting s8 eqiad failover from db1104 to db1109 - T294321

Mentioned in SAL (#wikimedia-operations) [2021-11-11T06:00:31Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T294321', diff saved to https://phabricator.wikimedia.org/P17721 and previous config saved to /var/cache/conftool/dbconfig/20211111-060031-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-11-11T06:01:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1109 to s8 primary and set section read-write T294321', diff saved to https://phabricator.wikimedia.org/P17722 and previous config saved to /var/cache/conftool/dbconfig/20211111-060102-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-11-11T06:02:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depooling db1104 (old master) T294321', diff saved to https://phabricator.wikimedia.org/P17723 and previous config saved to /var/cache/conftool/dbconfig/20211111-060242-marostegui.json

Change 737832 merged by Marostegui:

[operations/dns@master] wmnet: Update s8-master

https://gerrit.wikimedia.org/r/737832

This was done
RO started: 06:00:31
RO stopped: 06:01:02

Total read only time: 31 seconds

Mentioned in SAL (#wikimedia-operations) [2021-11-11T06:06:32Z] <marostegui> Stop replication on db1104 (old master) T294321