Page MenuHomePhabricator

Switchover s8 master (db1109 -> db1104)
Closed, ResolvedPublic

Description

When: During pre-defined window in Thursday 21st 2022

Checklist:

NEW primary: db1104
OLD primary: db1109

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1109.eqiad.wmnet h=db1104.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s8 T303927" 'A:db-section-s8'
  • Set NEW primary with weight 0
sudo dbctl instance db1104 set-weight 0
sudo dbctl config commit -m "Set db1104 with weight 0 T303927"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1109 db1104
  • Disable puppet on both nodes
sudo cumin 'db1109* or db1104*' 'disable-puppet "primary switchover T303927"'

Failover:

  • Log the failover:
!log Starting s8 eqiad failover from db1109 to db1104 - T303927
  • Set section read-only:
sudo dbctl --scope eqiad section s8 ro "Maintenance until 06:15 UTC - T303927"
sudo dbctl config commit -m "Set s8 eqiad as read-only for maintenance - T303927"
  • Check s8 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1109 db1104
echo "===== db1109 (OLD)"; sudo db-mysql db1109 -e 'show slave status\G'
echo "===== db1104 (NEW)"; sudo db-mysql db1104 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s8 set-master db1104
sudo dbctl --scope eqiad section s8 rw
sudo dbctl config commit -m "Promote db1104 to s8 primary and set section read-write T303927"
  • Restart puppet on both hosts:
sudo cumin 'db1109* or db1104*' 'run-puppet-agent -e "primary switchover T303927"'

Clean up tasks:

  • Clean up heartbeat table(s): delete from heartbeat.heartbeat where server_id=171978924
  • change events for query killer:
events_coredb_master.sql on the new primary db1104
events_coredb_slave.sql on the new slave db1109
sudo dbctl instance db1109 set-candidate-master --section s8 true
sudo dbctl instance db1104 set-candidate-master --section s8 false
(dborch1001): sudo orchestrator-client -c untag -i db1104 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1109 --tag name=candidate
sudo dbctl instance db1109 depool
sudo dbctl config commit -m "Depool db1109 T303927"
  • Apply outstanding schema changes to db1109 (if any) T303927#7870289
  • Update/resolve this ticket.

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui added a project: DBA.
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: Ladsgroup.

@Ladsgroup we can do this one together.

Marostegui moved this task from Triage to Ready on the DBA board.

@Ladsgroup we can do this one together.

😭😭😭😭😭😭😭😭😭

One notable difference in the config is this:

innodb_adaptive_hash_i... ON                        OFF
innodb_checksum_algorithm crc32                     full_crc32

Is that fine @Marostegui ?

Change 784678 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/dns@master] wmnet: Update s8-master CNAME

https://gerrit.wikimedia.org/r/784678

ladsgroup@db1104:~$ sudo uname -v
#1 SMP Debian 5.10.106-1 (2022-03-17)

It's already rebooted.

Change 784681 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Promote db1104 to s8 master

https://gerrit.wikimedia.org/r/784681

One notable difference in the config is this:

innodb_adaptive_hash_i... ON                        OFF
innodb_checksum_algorithm crc32                     full_crc32

Is that fine @Marostegui ?

Yes, that's the new one

@Ladsgroup Given that friday is a holiday, I would not reimage db1109 after the switch, but rather do it on Monday, in case it doesn't boot up or there are issues during the installation. It wouldn't be nice to spend the weekend without the candidate. We can run the pending schema changes meanwhile

Mentioned in SAL (#wikimedia-operations) [2022-04-21T05:01:08Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 31 hosts with reason: Primary switchover s8 T303927

Mentioned in SAL (#wikimedia-operations) [2022-04-21T05:01:29Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Primary switchover s8 T303927

Mentioned in SAL (#wikimedia-operations) [2022-04-21T05:01:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1104 with weight 0 T303927', diff saved to https://phabricator.wikimedia.org/P25872 and previous config saved to /var/cache/conftool/dbconfig/20220421-050154-ladsgroup.json

Change 784681 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1104 to s8 master

https://gerrit.wikimedia.org/r/784681

Mentioned in SAL (#wikimedia-operations) [2022-04-21T06:00:14Z] <Amir1> Starting s8 eqiad failover from db1109 to db1104 - T303927

Mentioned in SAL (#wikimedia-operations) [2022-04-21T06:00:23Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T303927', diff saved to https://phabricator.wikimedia.org/P25883 and previous config saved to /var/cache/conftool/dbconfig/20220421-060023-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-04-21T06:01:07Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1104 to s8 primary and set section read-write T303927', diff saved to https://phabricator.wikimedia.org/P25884 and previous config saved to /var/cache/conftool/dbconfig/20220421-060106-ladsgroup.json

Change 784678 merged by Marostegui:

[operations/dns@master] wmnet: Update s8-master CNAME

https://gerrit.wikimedia.org/r/784678

Mentioned in SAL (#wikimedia-operations) [2022-04-21T06:05:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1109 T303927', diff saved to https://phabricator.wikimedia.org/P25885 and previous config saved to /var/cache/conftool/dbconfig/20220421-060512-root.json

Change 785071 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1109: Disable notifications

https://gerrit.wikimedia.org/r/785071

Change 785071 merged by Marostegui:

[operations/puppet@production] db1109: Disable notifications

https://gerrit.wikimedia.org/r/785071

The schema changes will be applied and tracked on their own tasks.

RO 06:00:24 - 06:01:07 UTC (43 seconds of RO)