Failover m2 master db1065 to db1132
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Jul 1 2019, 6:16 AM

Description

db1065 is currently m2 master.
It is very old, out of warranty and needs to be decommissioned T217396: Decommission db1061-db1073 it also has 3 disks on predictive failure.
I would like to fail it over to db1132 so we can finally get rid of it.

These are the databases on this host:

root@cumin1001:~# mysql.py -hdb1065 -e "show databases"
+--------------------+
| Database           |
+--------------------+
| debmonitor         |
| heartbeat          |
| iegreview          |
| information_schema |
| mysql              |
| otrs               |
| performance_schema |
| recommendationapi  |
| reviewdb           |
| scholarships       |
+--------------------+

The active ones appear to be:

debmonitor
otrs
recommendationapi

The other databases appear to be unused (there have been no writes for more than a year).
The failover should be easy as it is a matter of reloading the haproxies with the new master.
The impact is that the master will be on read-only for around 1 minute. Reads should remain unaffected.
Any objection to do this failover 9th July at around 06:00UTC ?

Operations after the switch T226952#5295258:

For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:

sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'

Details

	Subject	Repo	Branch	Lines +/-
	mariadb: Promote db1132 to m2 master	operations/puppet	production	+8 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Marostegui	T220170 Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment
Resolved	Marostegui	T217396 Decommission db1061-db1073
Resolved	Marostegui	T226952 Failover m2 master db1065 to db1132

Event Timeline

Marostegui created this task.Jul 1 2019, 6:16 AM

Restricted Application added subscribers: • OldUser02, Aklapper. · View Herald TranscriptJul 1 2019, 6:16 AM

Marostegui triaged this task as Medium priority.Jul 1 2019, 6:16 AM

Marostegui moved this task from Triage to Pending comment on the DBA board.

Marostegui added a subscriber: • mobrovac.Jul 1 2019, 7:23 AM

debmonitor readonly time is not an issue, the debmonitor clients will simply retry the next time.

Because of the TTL mention, are you planning a failover of proxy at the same time?

In T226952#5295201, @jcrespo wrote:

Because of the TTL mention, are you planning a failover of proxy at the same time?

Good point, I wasn't - I just didn't have enough coffee at the time. I will amend. Thanks for reviewing.

Marostegui updated the task description. (Show Details)Jul 1 2019, 7:54 AM

I am actually proposing to maybe do it, but it needs more work.

Let's leave it aside for now :-)

For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:

sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'

just after the switch.
Alternatively if you plan to kill all existing connections to the old master that would do the trick already, because debmonitor will automatically reconnect and the proxy will send it to the new one.

The few failures that will happen during RO time and restart will not block any operation on the clients. If the interrupted operation was the update of upgradable packages that is triggered by the Puppet cron runs and will fix itself at the next puppet run. For any other operation (if a package was installed/removed/upgraded on a client at that time), it will be catch-up by the daily debmonitor crontab that is there exactly for those cases.
So TL;DR is that any discrepancy will be resolved in 24h (and if we really need it sooner we can force a run of the daily crontab via cumin).

Change 519975 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1132 to m2 master

https://gerrit.wikimedia.org/r/519975

gerritbot added a project: Patch-For-Review.Jul 1 2019, 8:31 AM

In T226952#5295258, @Volans wrote:
For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:
sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'
just after the switch.

I can take care of that.

Marostegui updated the task description. (Show Details)Jul 1 2019, 8:35 AM

Note: db2044 needs upgrading

Marostegui moved this task from Pending comment to In progress on the DBA board.Jul 3 2019, 7:20 AM

Marostegui added a parent task: T220170: Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment.Jul 3 2019, 8:47 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-04T08:08:28Z] <marostegui> Upgrade db2044 - T226952

The etherpad is ready with the procedure and ready for a review.
The patch is also ready for review: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519975/

In T226952#5295368, @Marostegui wrote:

Note: db2044 needs upgrading

This was done

Mentioned in SAL (#wikimedia-operations) [2019-07-08T14:44:58Z] <marostegui> Restart MySQL on db1132 to enable performance_schema - T226952

Mentioned in SAL (#wikimedia-operations) [2019-07-09T05:13:17Z] <marostegui> Rebooting pc2010 for a second time as per papaul's suggestion T226952

In T226952#5316025, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2019-07-09T05:13:17Z] <marostegui> Rebooting pc2010 for a second time as per papaul's suggestion T226952

Ignore this, it was for a different task

Mentioned in SAL (#wikimedia-operations) [2019-07-09T05:19:39Z] <marostegui> Start switchover steps T226952

$ ./replication_tree.py db1065
db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0991
+ db1117:3322, version: 10.1.39, up: 32d, RO: ON, binlog: MIXED, lag: 0, processes: 15, latency: 0.0423
+ db1132, version: 10.1.39, up: 14h, RO: ON, binlog: MIXED, lag: 0, processes: 16, latency: 0.0416
+ db2044, version: 10.1.39, up: 4d, RO: ON, binlog: MIXED, lag: 0, processes: None, latency: 0.0046
  + db2078:3322, version: 10.1.39, up: 47d, RO: ON, binlog: MIXED, lag: 0, processes: 14, latency: 0.0056

Change 519975 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1132 to m2 master

https://gerrit.wikimedia.org/r/519975

Mentioned in SAL (#wikimedia-operations) [2019-07-09T06:00:22Z] <marostegui> Failover m2 from db1065 to db1132 - T226952

This was done successfully.

Read only start: 06:00:31 UTC 2019
Read only stop (and proxies reloaded): 06:00:40 UTC 2019

Total read only time: 9 seconds.

Maintenance_bot removed a project: Patch-For-Review.Jul 9 2019, 6:10 AM

Marostegui closed this task as Resolved.Jul 9 2019, 7:30 AM

Marostegui claimed this task.

Failover m2 master db1065 to db1132Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Failover m2 master db1065 to db1132
Closed, ResolvedPublic
Actions

Related Objects
Search...