Page MenuHomePhabricator

Improve lag detection mechanism's reliability and agility
Closed, ResolvedPublic

Description

The way pt-heartbeat works now create some limitation given that is detached from the MySQL server.
If this will be used as a mechanism to detect lag also by MW and to simplify the switchover process we need to investigate the best solution to adopt here.

Minimal goals:

  • resilient, needs to always be running, and alarm if not
  • easy to switch datacenter (or always active/active)
  • easy to start/stop, right now relates on puppet (slow) and manual kill/start. Moreover the start doesn't works well with salt, possibly related to pt-heartbeat daemonization procedure.
  • not error prone. Right now in each shard needs to be started with the shard name, error prone when done manually

Event Timeline

Volans created this task.Apr 21 2016, 8:47 PM

Maybe merge into T111266?, although that is the software-mediawiki side of things, and this is the infrastructure side of things.

jcrespo triaged this task as High priority.Apr 22 2016, 1:30 PM
jcrespo moved this task from Triage to Backlog on the DBA board.
jcrespo renamed this task from Improve lag detection mechanism to Improve lag detection mechanism's reliability and agility.May 10 2016, 3:43 PM
jcrespo added a comment.EditedMay 18 2016, 4:36 PM

"resilient, needs to always be running, and alarm if not"

There is pt-heartbeat running on 2 separate host on 2 datacenters. In the unlikly case that both fail at the same time (and are not restarted by puppet every 30 minutes), we will get an alarm from replication lag (plus users will notice immediately by going to read-only).

If an operational error causes it (e.g. a bad schema change or permissions), things will fail anyway.

easy to switch datacenter (or always active/active)

Done

easy to start/stop

This technically has not been done, but it is not longer needed because it can be started by puppet asynchronously at any time, assuming we do not change both servers at the same time. In fact, we should not need to touch pt-heartbeat again unles we do a master failover.

Not error prone. Right now in each shard needs to be started with the shard name, error prone when done manually

Again, puppet will handle that. Things like changing the master can be done on orchestration with conftool, but I think that is out of the scope of this ticket.

jcrespo closed this task as Resolved.May 18 2016, 4:53 PM
jcrespo claimed this task.