Page MenuHomePhabricator

Switchover db1115 -> db1215
Closed, ResolvedPublic

Description

db1115 needs to be decommissioned.

Affected wikis:: None
Affected services:: Orchestrator, Prometheus, zarcillo

Checklist:

NEW primary: db1215
OLD primary: db1115

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1115.eqiad.wmnet h=db1215.eqiad.wmnet
  • Reboot db1215 so it picks the latest kernel

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover db_inventory T335014" 'A:db-inventory'
  • Topology changes, move all replicas under NEW primary
sudo db-switchover  --timeout=25 --only-slave-move db1115 db1215
  • Disable puppet on both nodes
sudo cumin 'db1115* or db1215*' 'disable-puppet "primary switchover T335014"'

Failover:

  • Log the failover:
!log Starting db-inventory eqiad failover from db1115 to db1215 - T335014
  • Switch primaries:
db-mysql db1115 -e "set global read_only=1;"
db1115: systemctl stop pt-heartbeat-wikimedia.service
db-mysql db1215 -e "show master status\G" # grab the positon
db-mysql db1215 -e "stop slave; reset slave all;"
db-mysql db1115 -e "change master to master_host='db1215.eqiad.wmnet', master_port=3306, master_ssl=1, master_log_file='XX', master_log_pos=XX, master_user='repl', master_password='XX'; start slave;"
db-mysql db1115 -e "show slave status\G"
db-mysql db1215 -e "set global read_only=0;"
db-mysql db1115 -e "STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;"
  • Restart puppet on both hosts:
sudo cumin 'db1115* or db1215*' 'run-puppet-agent -e "primary switchover T335014"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1215 heartbeat -e "delete from heartbeat where file like 'db1115%';"
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'tendril';"
  • Update all the services at: T334455
  • Update/resolve this ticket.

Event Timeline

Marostegui created this task.
Marostegui moved this task from Triage to In progress on the DBA board.

This needs changing as db-switchover won't work. I will re-write this with the manual steps required.

I need to reboot db1215 before the switch so it picks the latest kernel before becoming a master

Change 917303 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1215: Enable notifications

https://gerrit.wikimedia.org/r/917303

Change 917303 merged by Marostegui:

[operations/puppet@production] db1215: Enable notifications

https://gerrit.wikimedia.org/r/917303

Change 917323 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1215 to zarcillo master

https://gerrit.wikimedia.org/r/917323

Mentioned in SAL (#wikimedia-operations) [2023-05-09T05:24:19Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Primary switchover db_inventory T335014

Mentioned in SAL (#wikimedia-operations) [2023-05-09T05:24:34Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2185.codfw.wmnet,db[1115,1215].eqiad.wmnet with reason: Primary switchover db_inventory T335014

Change 917323 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1215 to zarcillo master

https://gerrit.wikimedia.org/r/917323

Mentioned in SAL (#wikimedia-operations) [2023-05-09T05:28:41Z] <marostegui> Starting db-inventory eqiad failover from db1115 to db1215 - T335014

Marostegui updated the task description. (Show Details)