Page MenuHomePhabricator

Failover m2 master db1065 to db1132
Closed, ResolvedPublic

Description

db1065 is currently m2 master.
It is very old, out of warranty and needs to be decommissioned T217396: Decommission db1061-db1073 it also has 3 disks on predictive failure.
I would like to fail it over to db1132 so we can finally get rid of it.

These are the databases on this host:

root@cumin1001:~# mysql.py -hdb1065 -e "show databases"
+--------------------+
| Database           |
+--------------------+
| debmonitor         |
| heartbeat          |
| iegreview          |
| information_schema |
| mysql              |
| otrs               |
| performance_schema |
| recommendationapi  |
| reviewdb           |
| scholarships       |
+--------------------+

The active ones appear to be:

debmonitor
otrs
recommendationapi

The other databases appear to be unused (there have been no writes for more than a year).
The failover should be easy as it is a matter of reloading the haproxies with the new master.
The impact is that the master will be on read-only for around 1 minute. Reads should remain unaffected.
Any objection to do this failover 9th July at around 06:00UTC ?

Operations after the switch T226952#5295258:

For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:

sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'

Event Timeline

Marostegui triaged this task as Medium priority.Jul 1 2019, 6:16 AM
Marostegui moved this task from Triage to Pending comment on the DBA board.

debmonitor readonly time is not an issue, the debmonitor clients will simply retry the next time.

Because of the TTL mention, are you planning a failover of proxy at the same time?

Because of the TTL mention, are you planning a failover of proxy at the same time?

Good point, I wasn't - I just didn't have enough coffee at the time. I will amend. Thanks for reviewing.

I am actually proposing to maybe do it, but it needs more work.

Let's leave it aside for now :-)

For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:

sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'

just after the switch.
Alternatively if you plan to kill all existing connections to the old master that would do the trick already, because debmonitor will automatically reconnect and the proxy will send it to the new one.

The few failures that will happen during RO time and restart will not block any operation on the clients. If the interrupted operation was the update of upgradable packages that is triggered by the Puppet cron runs and will fix itself at the next puppet run. For any other operation (if a package was installed/removed/upgraded on a client at that time), it will be catch-up by the daily debmonitor crontab that is there exactly for those cases.
So TL;DR is that any discrepancy will be resolved in 24h (and if we really need it sooner we can force a run of the daily crontab via cumin).

Change 519975 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1132 to m2 master

https://gerrit.wikimedia.org/r/519975

For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:

sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'

just after the switch.

I can take care of that.

The etherpad is ready with the procedure and ready for a review.
The patch is also ready for review: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519975/

Note: db2044 needs upgrading

This was done

Mentioned in SAL (#wikimedia-operations) [2019-07-08T14:44:58Z] <marostegui> Restart MySQL on db1132 to enable performance_schema - T226952

Mentioned in SAL (#wikimedia-operations) [2019-07-09T05:13:17Z] <marostegui> Rebooting pc2010 for a second time as per papaul's suggestion T226952

Mentioned in SAL (#wikimedia-operations) [2019-07-09T05:13:17Z] <marostegui> Rebooting pc2010 for a second time as per papaul's suggestion T226952

Ignore this, it was for a different task

$ ./replication_tree.py db1065
db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0991
+ db1117:3322, version: 10.1.39, up: 32d, RO: ON, binlog: MIXED, lag: 0, processes: 15, latency: 0.0423
+ db1132, version: 10.1.39, up: 14h, RO: ON, binlog: MIXED, lag: 0, processes: 16, latency: 0.0416
+ db2044, version: 10.1.39, up: 4d, RO: ON, binlog: MIXED, lag: 0, processes: None, latency: 0.0046
  + db2078:3322, version: 10.1.39, up: 47d, RO: ON, binlog: MIXED, lag: 0, processes: 14, latency: 0.0056

Change 519975 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1132 to m2 master

https://gerrit.wikimedia.org/r/519975

Mentioned in SAL (#wikimedia-operations) [2019-07-09T06:00:22Z] <marostegui> Failover m2 from db1065 to db1132 - T226952

This was done successfully.

Read only start: 06:00:31 UTC 2019
Read only stop (and proxies reloaded): 06:00:40 UTC 2019

Total read only time: 9 seconds.

Marostegui claimed this task.