Page MenuHomePhabricator

Switchover s4 from db2090 to db2110
Closed, ResolvedPublic

Description

When: 6th September 2021 at 05:00 AM UTC

Checklist:

  • Create a task to communicate the chosen date and send an announcement to the community: T289660
  • Create a calendar entry for the maintenance, invite sre-data-persistence@
  • Add to deployments calendar. E.g.:
{{Deployment calendar event card
    |when=2021-09-05 22:00 SF
    |length=0.5
    |window=Database primary switchover for s4
    |who={{ircnick|kormat|Stevie Beth Mhaol}}, {{ircnick|marostegui|Manuel 'Early Bird' Arostegui}}
    |what=https://phabricator.wikimedia.org/T289650
}}

NEW primary: db2110
OLD primary: db2090

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2090.codfw.wmnet h=db2110.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s4 T289650" 'A:db-section-s4'
  • Set NEW primary with weight 0
sudo dbctl instance db2110 set-weight 0
sudo dbctl config commit -m "Set db2110 with weight 0 T289650"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move db2090 db2110
  • Disable puppet on both nodes
sudo cumin 'db2090* or db2110*' 'disable-puppet "primary switchover T289650"'

Failover:

  • Log the failover:
!log Starting s4 codfw failover from db2090 to db2110 - T289650
  • Set section read-only:
sudo dbctl --scope codfw section s4 ro "Maintenance until 05:15 UTC - T289650"
sudo dbctl config commit -m "Set s4 codfw as read-only for maintenance - T289650"
  • Check s4 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db2090 db2110
echo "===== db2090 (OLD)"; sudo mysql.py -h db2090 -e 'show slave status\G'
echo "===== db2110 (NEW)"; sudo mysql.py -h db2110 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section s4 set-master db2110
sudo dbctl --scope codfw section s4 rw
sudo dbctl config commit -m "Promote db2110 to s4 primary and set section read-write T289650"
  • Restart puppet on both hosts:
sudo cumin 'db2090* or db2110*' 'run-puppet-agent -e "primary switchover T289650"'

Clean up tasks:

  • Clean up heartbeat table(s).
  • change events for query killer:
events_coredb_master.sql on the new primary db2110
events_coredb_slave.sql on the new slave db2090
sudo dbctl instance db2090 set-candidate-master --section s4 true
sudo dbctl instance db2110 set-candidate-master --section s4 false
  • Check tendril was updated
  • Check zarcillo was updated
  • Depool OLD primary, as it's running 10.1, replicating from a 10.4 primary
sudo dbctl instance db2090 depool
sudo dbctl config commit -m "Depool db2090 until it's reimaged to buster T289650"
  • Apply outstanding schema changes to db2090 (if any): Nothing to apply
  • Update/resolve this ticket.

Event Timeline

Marostegui updated the task description. (Show Details)
Marostegui added a project: DBA.
Marostegui moved this task from Triage to Blocked on the DBA board.

Change 716217 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2110 to s4 master

https://gerrit.wikimedia.org/r/716217

Change 716218 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Switchover db2090 with db2110

https://gerrit.wikimedia.org/r/716218

Mentioned in SAL (#wikimedia-operations) [2021-09-06T04:06:41Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 33 hosts with reason: Primary switchover s4 T289650

Mentioned in SAL (#wikimedia-operations) [2021-09-06T04:07:06Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 33 hosts with reason: Primary switchover s4 T289650

Mentioned in SAL (#wikimedia-operations) [2021-09-06T04:07:41Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2110 with weight 0 T289650', diff saved to https://phabricator.wikimedia.org/P17219 and previous config saved to /var/cache/conftool/dbconfig/20210906-040740-root.json

Change 716217 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2110 to s4 master

https://gerrit.wikimedia.org/r/716217

Mentioned in SAL (#wikimedia-operations) [2021-09-06T05:00:32Z] <marostegui> Starting s4 codfw failover from db2090 to db2110 - T289650

Mentioned in SAL (#wikimedia-operations) [2021-09-06T05:00:49Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T289650', diff saved to https://phabricator.wikimedia.org/P17220 and previous config saved to /var/cache/conftool/dbconfig/20210906-050048-root.json

Mentioned in SAL (#wikimedia-operations) [2021-09-06T05:01:41Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2110 to s4 primary and set section read-write T289650', diff saved to https://phabricator.wikimedia.org/P17221 and previous config saved to /var/cache/conftool/dbconfig/20210906-050140-root.json

Mentioned in SAL (#wikimedia-operations) [2021-09-06T05:04:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2090 T289650', diff saved to https://phabricator.wikimedia.org/P17222 and previous config saved to /var/cache/conftool/dbconfig/20210906-050419-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-09-06T05:05:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2110 (current master) from API T289650', diff saved to https://phabricator.wikimedia.org/P17223 and previous config saved to /var/cache/conftool/dbconfig/20210906-050502-marostegui.json

Change 716218 merged by Marostegui:

[operations/dns@master] wmnet: Switchover db2090 with db2110

https://gerrit.wikimedia.org/r/716218

Mentioned in SAL (#wikimedia-operations) [2021-09-06T05:07:47Z] <marostegui> Stop replication on db2090 (old s4 master) T289650 T288803

Switchover was done:

RO: 05:00:49
RW: 05:01:41
Total: 52 seconds

Marostegui updated the task description. (Show Details)