Page MenuHomePhabricator

Switchover m3 master db1159 -> db1119
Closed, ResolvedPublic

Description

Databases on m3: phabricator
When: TBD
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1159

NEW MASTER: db1119

  • Check configuration differences between new and old master

$ pt-config-diff h=db1159.eqiad.wmnet,F=/root/.my.cnf h=db1119.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --hours 1 -r "m3 master switchover T352149" 'A:db-section-m3'
  • Topology changes: move everything under db1119

db-switchover --timeout=15 --only-slave-move db1159.eqiad.wmnet db1119.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m3 from db1159 to db1119 - T352149
  • Set phabricator in RO:
ssh phab1004
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1159 db1119

  • Reload haproxies
dbproxy1026:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1020:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1159)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1119 and db1159 sudo cumin 'db1119* or db1159*' 'run-puppet-agent -e "primary switchover T352149"'
  • Check services affected: phabricator
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171966512;

Details

Related Changes in Gerrit:

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a project: Phabricator.

Just FYI, I will be putting phabricator in RO for a few seconds during this week (early in a European morning) to switchover the master.

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:41:23Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:41:43Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Change 978721 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1119 to m3 master

https://gerrit.wikimedia.org/r/978721

Change 978721 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1119 to m3 master

https://gerrit.wikimedia.org/r/978721

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:47:24Z] <marostegui> Failover m3 from db1159 to db1119 - T352149

This was done. The RO time was just a few seconds.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1159.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1159.eqiad.wmnet with OS bookworm completed:

  • db1159 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311300608_marostegui_2070516_db1159.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-12-01T05:31:59Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Mentioned in SAL (#wikimedia-operations) [2023-12-01T05:32:17Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149