Page MenuHomePhabricator

Improve replication lag detection for multi-dc environment
Closed, ResolvedPublic

Description

During the last swtichover the kill/start of pt-heartbeat on the masters of the active datacenter was not that smooth.
In particular the start of pt-heartbeat from salt didn't actually worked as expected. It needs some investigation to understand if/how the way pt-heartbeat daemonize itself doesn't work well with the way salt execute commands on hosts and what are the possible solutions.

As a more general topic, we need to find a reliable way to have a lag detection mechanism that works well with our multi-dc, multi-shard environment, given that MW will probably start using it, see T111266.

One possible solution could be to run pt-heartbeat on both masters (eqiad and codfw) all the time.
With the limitation that pt-heartbeat is an external process and as such could die, be killed, etc.

Event Timeline

jcrespo claimed this task.
jcrespo moved this task from Backlog to Done on the DBA board.

On T133337 I have setup pt-heartbeat to run on each datacenter's masters, resolving the HA issue and simplifying a lot the master and datacenter failover process. if pt-heartbeat dies on a host, the pt-heartbeat beat from the remote datacenter will be used automatically.

However, this change has some consequences about the measurement- it no longer reflects the lag between the master and its slaves, it reflects the lag between the local datacenter master and its slaves. Both measurement are the "same" (of course, with an error margin) until lag starts to appear between datacenters, or connection between them fails. In that case, a remote datacenter will not longer be set in "read only mode" and will think it is up to date.