Page MenuHomePhabricator

Switchover m3 master db1159 -> db1119
Closed, ResolvedPublic

Description

Databases on m3: phabricator
When: TBD
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1159

NEW MASTER: db1119

  • Check configuration differences between new and old master

$ pt-config-diff h=db1159.eqiad.wmnet,F=/root/.my.cnf h=db1119.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --hours 1 -r "m3 master switchover T352149" 'A:db-section-m3'
  • Topology changes: move everything under db1119

db-switchover --timeout=15 --only-slave-move db1159.eqiad.wmnet db1119.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m3 from db1159 to db1119 - T352149
  • Set phabricator in RO:
ssh phab1004
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1159 db1119

  • Reload haproxies
dbproxy1026:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1020:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1159)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1119 and db1159 sudo cumin 'db1119* or db1159*' 'run-puppet-agent -e "primary switchover T352149"'
  • Check services affected: phabricator
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171966512;

Event Timeline

Marostegui triaged this task as Medium priority.Nov 28 2023, 9:20 AM
Marostegui created this task.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a project: Phabricator.

Just FYI, I will be putting phabricator in RO for a few seconds during this week (early in a European morning) to switchover the master.

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:41:23Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:41:43Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Change 978721 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1119 to m3 master

https://gerrit.wikimedia.org/r/978721

Change 978721 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1119 to m3 master

https://gerrit.wikimedia.org/r/978721

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:47:24Z] <marostegui> Failover m3 from db1159 to db1119 - T352149

This was done. The RO time was just a few seconds.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1159.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1159.eqiad.wmnet with OS bookworm completed:

  • db1159 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311300608_marostegui_2070516_db1159.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-12-01T05:31:59Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Mentioned in SAL (#wikimedia-operations) [2023-12-01T05:32:17Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149