Page MenuHomePhabricator

Switchover s5 master db1130 -> db1100
Closed, ResolvedPublic

Description

When: Thursday 31th - 06:00 AM UTC
Affected wikis: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist

NEW primary: db1100
OLD primary: db1130

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1130.eqiad.wmnet h=db1100.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s5 T303798" 'A:db-section-s5'
  • Set NEW primary with weight 0
sudo dbctl instance db1100 set-weight 0
sudo dbctl config commit -m "Set db1100 with weight 0 T303798"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move db1130 db1100
  • Disable puppet on both nodes
sudo cumin 'db1130* or db1100*' 'disable-puppet "primary switchover T303798"'

Failover:

  • Log the failover:
!log Starting s5 eqiad failover from db1130 to db1100 - T303798
  • Set section read-only:
sudo dbctl --scope eqiad section s5 ro "Maintenance until 05:15 UTC - T303798"
sudo dbctl config commit -m "Set s5 eqiad as read-only for maintenance - T303798"
  • Check s5 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1130 db1100
echo "===== db1130 (OLD)"; sudo db-mysql db1130 -e 'show slave status\G'
echo "===== db1100 (NEW)"; sudo db-mysql db1100 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s5 set-master db1100
sudo dbctl --scope eqiad section s5 rw
sudo dbctl config commit -m "Promote db1100 to s5 primary and set section read-write T303798"
  • Restart puppet on both hosts:
sudo cumin 'db1130* or db1100*' 'run-puppet-agent -e "primary switchover T303798"'

Clean up tasks:

  • Clean up heartbeat table(s). delete from heartbeat.heartbeat where server_id=171970593
  • change events for query killer:
events_coredb_master.sql on the new primary db1100
events_coredb_slave.sql on the new slave db1130
sudo dbctl instance db1130 set-candidate-master --section s5 true
sudo dbctl instance db1100 set-candidate-master --section s5 false
sudo dbctl instance db1130 depool
sudo dbctl config commit -m "Depool db1130 T303798"
(dborch1001): sudo orchestrator-client -c untag -i db1100 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1130 --tag name=candidate
  • Apply outstanding schema changes to db1130 (if any)
  • Update/resolve this ticket.

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.

Future master rebooted with the new kernel.

Change 775195 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1100 to s5 master

https://gerrit.wikimedia.org/r/775195

Change 775196 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s5-master CNAME

https://gerrit.wikimedia.org/r/775196

Mentioned in SAL (#wikimedia-operations) [2022-03-31T04:38:37Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 22 hosts with reason: Primary switchover s5 T303798

Mentioned in SAL (#wikimedia-operations) [2022-03-31T04:38:52Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 22 hosts with reason: Primary switchover s5 T303798

Mentioned in SAL (#wikimedia-operations) [2022-03-31T04:39:06Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1100 with weight 0 T303798', diff saved to https://phabricator.wikimedia.org/P23978 and previous config saved to /var/cache/conftool/dbconfig/20220331-043906-marostegui.json

Change 775195 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1100 to s5 master

https://gerrit.wikimedia.org/r/775195

Mentioned in SAL (#wikimedia-operations) [2022-03-31T06:00:19Z] <marostegui> Starting s5 eqiad failover from db1130 to db1100 - T303798

Mentioned in SAL (#wikimedia-operations) [2022-03-31T06:00:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T303798', diff saved to https://phabricator.wikimedia.org/P23986 and previous config saved to /var/cache/conftool/dbconfig/20220331-060042-root.json

Mentioned in SAL (#wikimedia-operations) [2022-03-31T06:01:22Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T303798', diff saved to https://phabricator.wikimedia.org/P23987 and previous config saved to /var/cache/conftool/dbconfig/20220331-060122-root.json

Change 775196 merged by Marostegui:

[operations/dns@master] wmnet: Update s5-master CNAME

https://gerrit.wikimedia.org/r/775196

Mentioned in SAL (#wikimedia-operations) [2022-03-31T06:08:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1130 T303798', diff saved to https://phabricator.wikimedia.org/P23991 and previous config saved to /var/cache/conftool/dbconfig/20220331-060820-root.json

Marostegui updated the task description. (Show Details)

This was all done. Read only time was from 06:00:42 to 06:01:22, so 64 seconds.