Page MenuHomePhabricator

Switchover s2 from db2107 to db2104
Closed, ResolvedPublic

Description

As part of upgrading s2 to debian buster/mariadb 10.4, we need to switch the master to be db2104

When: Tue 10th Aug at 05:00 AM UTC.

Checklist:

  • Create a task to communicate the chosen date and send an announcement to the community: T287449

NEW master: db2104
OLD master: db2107

  • Check configuration differences between new and old master:
sudo pt-config-diff h=db2104.codfw.wmnet,F=/root/.my.cnf h=db2107.codfw.wmnet,F=/root/.my.cnf

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Master switchover s2 T287454" 'A:db-section-s2'
  • Set NEW master with weight 0
sudo dbctl instance db2104 set-weight 0
sudo dbctl config commit -m "Set db2104 with weight 0 T287454"
  • Topology changes, move all replicas under NEW master
sudo db-switchover --timeout=15 --only-slave-move db2107.codfw.wmnet db2104.codfw.wmnet
  • Disable puppet on both nodes
sudo cumin 'db2104* or db2107*' 'disable-puppet "master switchover T287454"'

Failover:

  • Log the failover:
!log Starting s2 codfw failover from db2107 to db2104 - T287454
  • Set section read-only:
sudo dbctl --scope codfw section s2 ro "Maintenance until 05:15 UTC - T287454"
sudo dbctl config commit -m "Set s2 codfw as read-only for maintenance - T287454"
  • Check s2 is indeed read-only
  • Switch masters:
sudo DEBUG=1 db-switchover --skip-slave-move db2107 db2104
echo "===== db2107 (OLD)"; sudo mysql.py -h db2107 -e 'show slave status\G'
echo "===== db2104 (NEW)"; sudo mysql.py -h db2104 -e 'show slave status\G'
  • Promote NEW master in dbctl, and remove read-only
sudo dbctl --scope codfw section s2 set-master db2104
sudo dbctl --scope codfw section s2 rw
sudo dbctl config commit -m "Promote db2104 to s2 master and set section read-write T287454"
  • Restart puppet on both hosts (for heartbeat):
sudo cumin 'db2104* or db2107*' 'run-puppet-agent -e "master switchover T287454"'

Clean up tasks:

  • change events for query killer:
events_coredb_master.sql on the new master db2104
events_coredb_slave.sql on the new slave db2107
sudo dbctl instance db2107 set-candidate-master --section s2 true
sudo dbctl instance db2104 set-candidate-master --section s2 false
  • Check tendril was updated
  • Check zarcillo was updated
  • Depool OLD master, as it's running 10.1, replicating from a 10.4 master
sudo dbctl instance db2107 depool
sudo dbctl config commit -m "Depool db2107 until it's reimaged to buster T287454"
  • Update/resolve this ticket.

Event Timeline

Marostegui moved this task from Triage to Blocked on the DBA board.
Marostegui updated the task description. (Show Details)

Change 710516 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2104 to s2 master

https://gerrit.wikimedia.org/r/710516

Change 710517 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s2-master CNAME

https://gerrit.wikimedia.org/r/710517

Mentioned in SAL (#wikimedia-operations) [2021-08-10T04:16:28Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2104 with weight 0 T287454', diff saved to https://phabricator.wikimedia.org/P16981 and previous config saved to /var/cache/conftool/dbconfig/20210810-041627-root.json

Mentioned in SAL (#wikimedia-operations) [2021-08-10T04:23:37Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454

Mentioned in SAL (#wikimedia-operations) [2021-08-10T04:23:56Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454

Change 711029 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2104 to s2 master

https://gerrit.wikimedia.org/r/711029

Change 710516 abandoned by Marostegui:

[operations/puppet@production] mariadb: Promote db2104 to s2 master

Reason:

Could rebase this here, so pushing instead: https://gerrit.wikimedia.org/r/711029

https://gerrit.wikimedia.org/r/710516

Change 711029 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2104 to s2 master

https://gerrit.wikimedia.org/r/711029

Mentioned in SAL (#wikimedia-operations) [2021-08-10T05:00:34Z] <marostegui> Starting s2 codfw failover from db2107 to db2104 - T287454

Mentioned in SAL (#wikimedia-operations) [2021-08-10T05:00:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T287454', diff saved to https://phabricator.wikimedia.org/P16982 and previous config saved to /var/cache/conftool/dbconfig/20210810-050051-root.json

Mentioned in SAL (#wikimedia-operations) [2021-08-10T05:06:04Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s2 as read-write again - master has not been swapped T287454', diff saved to https://phabricator.wikimedia.org/P16983 and previous config saved to /var/cache/conftool/dbconfig/20210810-050604-root.json

Maintenance wasn't completed, the db-switchover script never ended up. This is the first time we use it from codfw, so there might be things to look at. It didn't work from cumin in codfw either, (neither using hostnames or FQDNs)

Topology rolled back, GTID enabled. Puppet enabled back on the master and candidate.
Going to do a data check

Change 711114 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2104 to s2 master

https://gerrit.wikimedia.org/r/711114

Reserved another window in the Deployment calendar, for 11th August at 05:00 AM UTC

Mentioned in SAL (#wikimedia-operations) [2021-08-11T04:15:10Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454

Mentioned in SAL (#wikimedia-operations) [2021-08-11T04:15:27Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454

Mentioned in SAL (#wikimedia-operations) [2021-08-11T04:16:26Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2104 with weight 0 T287454', diff saved to https://phabricator.wikimedia.org/P16996 and previous config saved to /var/cache/conftool/dbconfig/20210811-041625-root.json

Change 711114 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2104 to s2 master

https://gerrit.wikimedia.org/r/711114

Mentioned in SAL (#wikimedia-operations) [2021-08-11T05:00:26Z] <marostegui> Starting s2 codfw failover from db2107 to db2104 - T287454

Mentioned in SAL (#wikimedia-operations) [2021-08-11T05:00:41Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T287454', diff saved to https://phabricator.wikimedia.org/P16997 and previous config saved to /var/cache/conftool/dbconfig/20210811-050040-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-08-11T05:10:41Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2104 to s2 master and set section read-write T287454', diff saved to https://phabricator.wikimedia.org/P16998 and previous config saved to /var/cache/conftool/dbconfig/20210811-051041-root.json

Change 710517 merged by Marostegui:

[operations/dns@master] wmnet: Update s2-master CNAME

https://gerrit.wikimedia.org/r/710517

Mentioned in SAL (#wikimedia-operations) [2021-08-11T05:18:56Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2107 T287454', diff saved to https://phabricator.wikimedia.org/P16999 and previous config saved to /var/cache/conftool/dbconfig/20210811-051856-marostegui.json

This was done.
RO start: 05:00
RO stops: 05:10

RO time: 10 minutes

Mentioned in SAL (#wikimedia-operations) [2021-08-11T05:22:13Z] <marostegui> Stop replication on db2107 T287454

Change 711257 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2107: Disable notifications

https://gerrit.wikimedia.org/r/711257

Change 711257 merged by Marostegui:

[operations/puppet@production] db2107: Disable notifications

https://gerrit.wikimedia.org/r/711257

Marostegui updated the task description. (Show Details)

Closing this as it worked - further investigations will be tracked at T288500

Mentioned in SAL (#wikimedia-operations) [2021-09-16T15:04:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech on read-only for maintenance T287454', diff saved to https://phabricator.wikimedia.org/P17283 and previous config saved to /var/cache/conftool/dbconfig/20210916-150444-marostegui.json