The way pt-heartbeat works now create some limitation given that is detached from the MySQL server.
If this will be used as a mechanism to detect lag also by MW and to simplify the switchover process we need to investigate the best solution to adopt here.
Minimal goals:
- resilient, needs to always be running, and alarm if not
- easy to switch datacenter (or always active/active)
- easy to start/stop, right now relates on puppet (slow) and manual kill/start. Moreover the start doesn't works well with salt, possibly related to pt-heartbeat daemonization procedure.
- not error prone. Right now in each shard needs to be started with the shard name, error prone when done manually