Page MenuHomePhabricator

Switchover m2 master (db1183 -> db1159)
Closed, ResolvedPublic

Description

db1183 needs to be reimaged to Bullseye.
Let's promote db1159 to master

When: Thursday 3rd Feb at 9:00AM UTC
Impact: Read only for a few seconds on the services below:

Services running on m2:

  • otrs
  • debmonitor
  • xhgui
  • recommendationapi
  • iegreview
  • scholarships
  • sockpuppet
  • mwaddlink

Switchover steps:

OLD MASTER: db1183

NEW MASTER: db1159

Check configuration differences between new and old master

  • $ pt-config-diff h=db1159.eqiad.wmnet,F=/root/.my.cnf h=db1183.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1159

db-switchover --timeout=1 --only-slave-move db1183.eqiad.wmnet db1159.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1183 to db1159 - T300329

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1183 db1159
  • Reload haproxies
dbproxy1013:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1015:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1183)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1183 and db1159

puppet agent --enable && run-puppet-agent

  • Check services affected (otrs, debmonitor etc)
  • Clean orchestrator heartbeat to remove the old masters' one:
    • server_id=171970778
  • Create floating ticket for db1183 to be moved to m3: T300835
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui triaged this task as Medium priority.Jan 28 2022, 6:49 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui added subscribers: dpifke, hnowlan, kostajh and 3 others.

@bd808 @Krinkle @MoritzMuehlenhoff @kostajh @hnowlan @dpifke I would like to failover m2 master on Thursday 3rd Feb at 9:00AM UTC. During some seconds writes will be blocked (reads will not be affected).
This will affect the following services:

  • otrs
  • debmonitor
  • xhgui
  • recommendationapi
  • iegreview
  • scholarships
  • sockpuppet
  • mwaddlink

If no one says otherwise in the next few days, I will get this scheduled for that date. Thanks!

Sounds good to me (for mwaddlink).

Nothing to do for debmonitor, it will reconnect automatically to the new host, so anytime is good.

No concerns as regards sockpuppet, it is not currently receiving writes afaik.

Adding @Arnoldokoth as well for vrts (formerly named otrs). The software will automatically reconnect to the new host, but best to keep an eye out.

As far as recommendationapi and mwaddlink go, for the former we know it will not have issues, for the latter, I have no recollection of having tested this. While I don't expect surprises, it 'll be good to keep an eye out for any. I am adding @kostajh and @Tgr. I 'll be around too.

Finally, I think iegreview is being undeployed. I see @bd808 is already subscribed, adding @Dzahn as well.

No objections from xhgui owners, the service is low traffic and should automatically reconnect.

  • iegreview

PHP app without persistent connections, so it should recover fine. It doesn't look like there are any active grant rounds being managed there right now, so probably nobody will even notice.

  • scholarships

T243037: Shutdown scholarships.wikimedia.org and archive project

Thanks everyone! I will get this scheduled for Thursday 3rd Feb at 9:00AM UTC

Change 759222 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1159 to m2 master

https://gerrit.wikimedia.org/r/759222

Mentioned in SAL (#wikimedia-operations) [2022-02-03T07:23:30Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2078,2133].codfw.wmnet,db[1117,1159,1183].eqiad.wmnet with reason: Switchover m2 T300329

Mentioned in SAL (#wikimedia-operations) [2022-02-03T07:23:34Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2078,2133].codfw.wmnet,db[1117,1159,1183].eqiad.wmnet with reason: Switchover m2 T300329

Change 759222 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1159 to m2 master

https://gerrit.wikimedia.org/r/759222

All pre-failover steps are done

Mentioned in SAL (#wikimedia-operations) [2022-02-03T09:00:02Z] <marostegui> Failover m2 from db1183 to db1159 - T300329

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: jcrespo.

All done, thanks a lot @akosiaris @jcrespo for the support!