Page MenuHomePhabricator

Failover m1 primary db from db1128 to db1164
Closed, ResolvedPublic

Description

Databases on m1:

+--------------------+
| Database           |
+--------------------+
| bacula9            |
| cas                |
| cas_staging        |
| dbbackups          |
| etherpadlite       |
| heartbeat          |
| information_schema |
| librenms           |
| mysql              |
| percona            |
| performance_schema |
| pki                |
| racktables         |
| rddmarc            |
| rt                 |
| sys                |
+--------------------+
16 rows in set (0.002 sec)

When: Soon
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1128

NEW MASTER: db1164

  • Check configuration differences between new and old master

$ pt-config-diff h=db1128.eqiad.wmnet,F=/root/.my.cnf h=db1164.eqiad.wmnet,F=/root/.my.cnf

db-switchover --timeout=15 --only-slave-move db1128.eqiad.wmnet db1164.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m1 from db1128 to db1164 - T309296
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --read-only-master --skip-slave-move db1128 db1164

  • Reload haproxies (dbproxy1012 is the active one)
dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1128)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1128 and db1164 puppet agent --enable && puppet agent -tv
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id= 171966562
  • If everything looks good, afterwards: Move backups to use the new host: Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/799894
  • Close this ticket and create a ticket to move db1128 to s1 T309303

Event Timeline

Marostegui triaged this task as High priority.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-05-26T11:13:12Z] <marostegui> Failover m1 from db1128 to db1164 - T309296

Change 799930 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Update db1128 and db1164 status

https://gerrit.wikimedia.org/r/799930

Change 799930 merged by Marostegui:

[operations/puppet@production] mariadb: Update db1128 and db1164 status

https://gerrit.wikimedia.org/r/799930

Services look fine so far. Jaime is going to run a couple of tests backups-wise to confirm it is all good from that side too.

Bacula is fine. Waiting for everything else to confirm is ok to deploy the dbbackups patch.

jcrespo updated the task description. (Show Details)

Patch deployed, backup started, task created: T309303