Page MenuHomePhabricator

Switchover s4 master (db1138 -> db1160)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows. Tuesday 12th April at 06:00 AM UTC

Affected wikis:: commonswiki testcommonswiki

Checklist:

NEW primary: db1160
OLD primary: db1138

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1138.eqiad.wmnet h=db1160.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s4 T304933" 'A:db-section-s4'
  • Set NEW primary with weight 0
sudo dbctl instance db1160 set-weight 0
sudo dbctl config commit -m "Set db1160 with weight 0 T304933"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move db1138 db1160
  • Disable puppet on both nodes
sudo cumin 'db1138* or db1160*' 'disable-puppet "primary switchover T304933"'

Failover:

  • Log the failover:
!log Starting s4 eqiad failover from db1138 to db1160 - T304933
  • Set section read-only:
sudo dbctl --scope eqiad section s4 ro "Maintenance until 06:15 UTC - T304933"
sudo dbctl config commit -m "Set s4 eqiad as read-only for maintenance - T304933"
  • Check s4 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1138 db1160
echo "===== db1138 (OLD)"; sudo db-mysql db1138 -e 'show slave status\G'
echo "===== db1160 (NEW)"; sudo db-mysql db1160 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s4 set-master db1160
sudo dbctl --scope eqiad section s4 rw
sudo dbctl config commit -m "Promote db1160 to s4 primary and set section read-write T304933"
  • Restart puppet on both hosts:
sudo cumin 'db1138* or db1160*' 'run-puppet-agent -e "primary switchover T304933"'

Clean up tasks:

  • Clean up heartbeat table(s): delete from heartbeat.heartbeat where server_id=171978876
  • change events for query killer:
events_coredb_master.sql on the new primary db1160
events_coredb_slave.sql on the new slave db1138
sudo dbctl instance db1138 set-candidate-master --section s4 true
sudo dbctl instance db1160 set-candidate-master --section s4 false
(dborch1001): sudo orchestrator-client -c untag -i db1160 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1138 --tag name=candidate
sudo dbctl instance db1138 depool
sudo dbctl config commit -m "Depool db1138 T304933"
  • Apply outstanding schema changes to db1138 (if any) T304933#7847187
  • Update/resolve this ticket.

Event Timeline

Marostegui updated the task description. (Show Details)
Marostegui added projects: DBA, User-notice.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Let's reboot db1160 before the switch to complete T303174

Change 775193 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1160: Disable notifications

https://gerrit.wikimedia.org/r/775193

Change 775193 merged by Marostegui:

[operations/puppet@production] db1160: Disable notifications

https://gerrit.wikimedia.org/r/775193

Let's reboot db1160 before the switch to complete T303174

done

I will do this Thursday 7th instead.

I will do this Thursday 7th instead.

This will be done on Tuesday 12th

Change 778688 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1160 to s4 master

https://gerrit.wikimedia.org/r/778688

Change 778689 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s4 CNAME

https://gerrit.wikimedia.org/r/778689

LSobanski added a subscriber: LSobanski.

Mentioned in SAL (#wikimedia-operations) [2022-04-12T04:49:54Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 31 hosts with reason: Primary switchover s4 T304933

Mentioned in SAL (#wikimedia-operations) [2022-04-12T04:50:14Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Primary switchover s4 T304933

Mentioned in SAL (#wikimedia-operations) [2022-04-12T04:50:23Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1160 with weight 0 T304933', diff saved to https://phabricator.wikimedia.org/P24482 and previous config saved to /var/cache/conftool/dbconfig/20220412-045023-root.json

Change 778688 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1160 to s4 master

https://gerrit.wikimedia.org/r/778688

Mentioned in SAL (#wikimedia-operations) [2022-04-12T06:00:47Z] <marostegui> Starting s4 eqiad failover from db1138 to db1160 - T304933

Mentioned in SAL (#wikimedia-operations) [2022-04-12T06:00:57Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T304933', diff saved to https://phabricator.wikimedia.org/P24485 and previous config saved to /var/cache/conftool/dbconfig/20220412-060057-root.json

Mentioned in SAL (#wikimedia-operations) [2022-04-12T06:01:25Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T304933', diff saved to https://phabricator.wikimedia.org/P24486 and previous config saved to /var/cache/conftool/dbconfig/20220412-060125-root.json

Change 778689 merged by Marostegui:

[operations/dns@master] wmnet: Update s4 CNAME

https://gerrit.wikimedia.org/r/778689

Mentioned in SAL (#wikimedia-operations) [2022-04-12T06:06:29Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1138 T304933', diff saved to https://phabricator.wikimedia.org/P24487 and previous config saved to /var/cache/conftool/dbconfig/20220412-060628-root.json

Marostegui updated the task description. (Show Details)

This was done, read only time was 28 seconds.

06:00:57
06:01:25

The schema changes pending will be tracked on the individual schema changes tasks.