Page MenuHomePhabricator

Base replication lag detection on heartbeat
Closed, ResolvedPublic

Event Timeline

Change 642379 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

You can check both the replication alerting code, and the mediawiki lag detection as bases for this, both work without any cleanup required (although not opposed to it). I just suggest not reinventing the wheel with a 3rd method :-).

@jcrespo: Orchestrator only supports a single query for all instances. This means we can't supply per-DC/-section/-etc parameters. It also means we can't rely on the section's primary master to be in the $mw_primary DC, as that wasn't true for the misc sections while mw was running in codfw a couple of months ago. So, in short, there's no way to do this using an existing wheel.

@jcrespo: Orchestrator only supports a single query for all instances. This means we can't supply per-DC/-section/-etc parameters.

:'-(

Then indeed we will need a third method- but I think a slightly modification of the query on the perl script may work, ignoring those parameters. Let me know if I can help.

Some host may not work at all- multisource host like labs/clouddb may not be able to use heartbeat at all, as they need different rows for different replication channels.

For the rest, something like may be imperfect but could work:

SELECT greatest(0, TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) - 500000) AS lag FROM heartbeat.heartbeat ORDER BY ts DESC LIMIT 1;

Considerations:

  • A non-primary master instance may be running heartbeat (e.g. all parsercache nodes, section masters in backup DC)
    • Ignoring any heartbeat entry generated by the local instance causes all primary masters to show an arbitrary amount of lag, as they will only evaluate stale entries in the heartbeat table.
  • Solution must allow circular replication, like we do leading up to DC switchovers
  • Solution must not assume primary master is in the MW primary DC.
    • This would have broken for the misc sections when MW was in CODFW recently.
  • Ideally an instance would display lag relative to it's DC-local master.
    • Otherwise if master in the secondary DC is lagging, _all_ instances in that DC will show lag, which just adds a lot of noise for no information gain.

If there are no obsolete entries in the heartbeat table, then we can simply do MAX(NOW()-ts) (roughly).

Ways to achieve this:

  1. Clean up the heartbeat table so that it only contains entries that are supposed to be there (T268336: Cleanup heartbeat.heartbeat on all production instances)
    • With this, we can use the oldest entry in the heartbeat table to measure the current lag, because all entries are relevant.
    • This will work for both the current situation, circular replication, and active-active.
  2. Only run pt-heartbeat on primary masters
    • With this, we can use the newest entry in the heartbeat table to measure the current lag, because none of the other entries matter.
    • This will work for the current situation, but will not for circular replication/active-active, as we're back to having >1 entry that matters.
herron triaged this task as Medium priority.Nov 20 2020, 3:08 PM

Change 642379 merged by Kormat:
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

Kormat claimed this task.

The orchestrator config change has been deployed, and the heartbeat tables for pc{1,2,3} have been cleaned up. Other sections will need similar cleanups before orchestrator can manage them properly.

The orchestrator config change has been deployed, and the heartbeat tables for pc{1,2,3} have been cleaned up. Other sections will need similar cleanups before orchestrator can manage them properly.

We can test it probably by stopping the intermediate master replication and see how orchestrator finds it (although we'd need to move pc2010 back under pc2007 as it was moved up to be a sibling for 10.4.17 testing)

I've tested it in pontoon — stopping heartbeat on the master causes immediate lag to show up for the entire tree.

(and re-starting heartbeat makes the lag disappear ~instantly)