Page MenuHomePhabricator

Switchover m1 master (db1164-> db1119)
Closed, ResolvedPublic

Description

db1119 needs to be reimaged

When: Tuesday 14th at 08:00 AM UTC
Impact: Read only for a few seconds on the services below:

Services running on m1:

  • bacula
  • cas (and cas staging)
  • backups
  • etherpaddb1164
  • librenms
  • pki
  • rt

Switchover steps:

OLD MASTER: db1164

NEW MASTER: db1119

  • Check configuration differences between new and old master pt-config-diff h=db1164.eqiad.wmnet,F=/root/.my.cnf h=db1119.eqiad.wmnet,F=/root/.my.cnf
  • Enable notifications on db1119
  • Silence alerts on all hosts
  • Topology changes: move everything under db1119

db-switchover --timeout=1 --only-slave-move db1164.eqiad.wmnet db1119.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1164 to db1119 - T350022

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1164 db1119
  • Reload haproxies
dbproxy1022:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1024:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1164)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1119 and db1164

sudo cumin 'db1164* or db1119*' 'run-puppet-agent -e "primary switchover T350022"'

  • Check services affected (librenms, racktables, etherpad...)
  • Clean orchestrator heartbeat to remove the old masters' one: sudo db-mysql db1119 heartbeat -e "delete from heartbeat where file like 'db1164%';"
  • Merge backup ticket: https://gerrit.wikimedia.org/r/c/operations/puppet/+/969753
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.

@jcrespo what would be the less disruptive day to do this in terms of backups/bacula?

Tomorrow (or if you need more time, one week later) would be a good day- backups will have ran the night before (they usually finish at 5:20 UTC) and you are free to do any maintenance there.

If you ensure backups worked well (to make sure nothing is missing after the upgrade)- you should be ok to go.

For bacula it will be a bit more complicated- the first week there is usually more overload. So maybe, if it can be done tomorrow, wait until 11 am UTC (or a week later). I can prepare the patches by then.

Change 969753 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover master from db1164 to db1119

https://gerrit.wikimedia.org/r/969753

Next week is probably better. I want to give enough time to the future master + backup source to make sure they are stable.

@jcrespo I am going to perform this on Tuesday 14th at 08:00 AM UTC (let me know if you'd prefer some other time)

Marostegui updated the task description. (Show Details)

Active proxy is dbproxy1022

Tomorrow I will add db1119 as standby host until Tuesday, so we can make sure haproxy sees it correctly.

Change 972921 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy102[2,4]: Promote db1119 to standby

https://gerrit.wikimedia.org/r/972921

@ABran-WMF https://gerrit.wikimedia.org/r/c/operations/puppet/+/972921 I would appreciate if you can double check the hostname/ip to make sure I didn't make a mistake.

Change 972921 merged by Marostegui:

[operations/puppet@production] dbproxy102[2,4]: Promote db1119 to standby

https://gerrit.wikimedia.org/r/972921

Change 972921 merged by Marostegui:

[operations/puppet@production] dbproxy102[2,4]: Promote db1119 to standby

https://gerrit.wikimedia.org/r/972921

Merged, everything looks good so far. I will revert tomorrow to leave it as it would normally be for the weekend.

Change 973187 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1119: Enable notifications

https://gerrit.wikimedia.org/r/973187

Change 973187 merged by Marostegui:

[operations/puppet@production] db1119: Enable notifications

https://gerrit.wikimedia.org/r/973187

Change 973351 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1119 to m1 master

https://gerrit.wikimedia.org/r/973351

Change 973351 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1119 to m1 master

https://gerrit.wikimedia.org/r/973351

Mentioned in SAL (#wikimedia-operations) [2023-11-14T07:39:34Z] <jynus> stop bacula dir (and puppet) at backup1001 T350022

Mentioned in SAL (#wikimedia-operations) [2023-11-14T08:04:45Z] <marostegui> Failover m1 from db1164 to db1119 - T350022

Change 969753 merged by Marostegui:

[operations/puppet@production] dbbackups: Switchover master from db1164 to db1119

https://gerrit.wikimedia.org/r/969753

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

This was done, only a few seconds of read only time.

For the record, etherpad was fine - no restart was required.
Though this was needed:

root@pki1001:~# systemctl restart  cfssl-ocsprefresh-debmonitor.service