https://github.com/openark/orchestrator/blob/master/docs/configuration-discovery-classifying.md#replication-lag allows us to use a custom query to detect lag.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
orchestrator: Use heartbeat table to detect lag. | operations/puppet | production | +1 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Kormat | T268316 Base replication lag detection on heartbeat | |||
Resolved | Marostegui | T268336 Cleanup heartbeat.heartbeat on all production instances | |||
Resolved | Marostegui | T273593 Clean up heartbeat table on clouddb hosts | |||
Resolved | Marostegui | T281826 Cleanup heartbeat.heartbeat on s2 | |||
Resolved | Marostegui | T281827 Cleanup heartbeat.heartbeat on s3 | |||
Resolved | Marostegui | T281828 Cleanup heartbeat.heartbeat on s5 | |||
Resolved | Marostegui | T281829 Cleanup heartbeat.heartbeat on s6 | |||
Resolved | Marostegui | T281830 Cleanup heartbeat.heartbeat on s8 |
Event Timeline
Change 642379 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.
Trying to do this in a reasonable fashion doesn't seem possible without T268336: Cleanup heartbeat.heartbeat on all production instances being done first.
You can check both the replication alerting code, and the mediawiki lag detection as bases for this, both work without any cleanup required (although not opposed to it). I just suggest not reinventing the wheel with a 3rd method :-).
@jcrespo: Orchestrator only supports a single query for all instances. This means we can't supply per-DC/-section/-etc parameters. It also means we can't rely on the section's primary master to be in the $mw_primary DC, as that wasn't true for the misc sections while mw was running in codfw a couple of months ago. So, in short, there's no way to do this using an existing wheel.
@jcrespo: Orchestrator only supports a single query for all instances. This means we can't supply per-DC/-section/-etc parameters.
:'-(
Then indeed we will need a third method- but I think a slightly modification of the query on the perl script may work, ignoring those parameters. Let me know if I can help.
Some host may not work at all- multisource host like labs/clouddb may not be able to use heartbeat at all, as they need different rows for different replication channels.
For the rest, something like may be imperfect but could work:
SELECT greatest(0, TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) - 500000) AS lag FROM heartbeat.heartbeat ORDER BY ts DESC LIMIT 1;
Considerations:
- A non-primary master instance may be running heartbeat (e.g. all parsercache nodes, section masters in backup DC)
- Ignoring any heartbeat entry generated by the local instance causes all primary masters to show an arbitrary amount of lag, as they will only evaluate stale entries in the heartbeat table.
- Solution must allow circular replication, like we do leading up to DC switchovers
- Solution must not assume primary master is in the MW primary DC.
- This would have broken for the misc sections when MW was in CODFW recently.
- Ideally an instance would display lag relative to it's DC-local master.
- Otherwise if master in the secondary DC is lagging, _all_ instances in that DC will show lag, which just adds a lot of noise for no information gain.
If there are no obsolete entries in the heartbeat table, then we can simply do MAX(NOW()-ts) (roughly).
Ways to achieve this:
- Clean up the heartbeat table so that it only contains entries that are supposed to be there (T268336: Cleanup heartbeat.heartbeat on all production instances)
- With this, we can use the oldest entry in the heartbeat table to measure the current lag, because all entries are relevant.
- This will work for both the current situation, circular replication, and active-active.
- Only run pt-heartbeat on primary masters
- With this, we can use the newest entry in the heartbeat table to measure the current lag, because none of the other entries matter.
- This will work for the current situation, but will not for circular replication/active-active, as we're back to having >1 entry that matters.
Change 642379 merged by Kormat:
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.
The orchestrator config change has been deployed, and the heartbeat tables for pc{1,2,3} have been cleaned up. Other sections will need similar cleanups before orchestrator can manage them properly.
We can test it probably by stopping the intermediate master replication and see how orchestrator finds it (although we'd need to move pc2010 back under pc2007 as it was moved up to be a sibling for 10.4.17 testing)
I've tested it in pontoon — stopping heartbeat on the master causes immediate lag to show up for the entire tree.