Base replication lag detection on heartbeat
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Kormat
	Nov 20 2020, 9:31 AM

Description

https://github.com/openark/orchestrator/blob/master/docs/configuration-discovery-classifying.md#replication-lag allows us to use a custom query to detect lag.

Details

	Subject	Repo	Branch	Lines +/-
	orchestrator: Use heartbeat table to detect lag.	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Kormat	T268316 Base replication lag detection on heartbeat
Resolved	Marostegui	T268336 Cleanup heartbeat.heartbeat on all production instances
Resolved	Marostegui	T273593 Clean up heartbeat table on clouddb hosts
Resolved	Marostegui	T281826 Cleanup heartbeat.heartbeat on s2
Resolved	Marostegui	T281827 Cleanup heartbeat.heartbeat on s3
Resolved	Marostegui	T281828 Cleanup heartbeat.heartbeat on s5
Resolved	Marostegui	T281829 Cleanup heartbeat.heartbeat on s6
Resolved	Marostegui	T281830 Cleanup heartbeat.heartbeat on s8

Event Timeline

Kormat created this task.Nov 20 2020, 9:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2020, 9:31 AM

LSobanski subscribed.Nov 20 2020, 9:35 AM

Change 642379 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

gerritbot added a project: Patch-For-Review.Nov 20 2020, 11:52 AM

Trying to do this in a reasonable fashion doesn't seem possible without T268336: Cleanup heartbeat.heartbeat on all production instances being done first.

You can check both the replication alerting code, and the mediawiki lag detection as bases for this, both work without any cleanup required (although not opposed to it). I just suggest not reinventing the wheel with a 3rd method :-).

@jcrespo: Orchestrator only supports a single query for all instances. This means we can't supply per-DC/-section/-etc parameters. It also means we can't rely on the section's primary master to be in the $mw_primary DC, as that wasn't true for the misc sections while mw was running in codfw a couple of months ago. So, in short, there's no way to do this using an existing wheel.

@jcrespo: Orchestrator only supports a single query for all instances. This means we can't supply per-DC/-section/-etc parameters.

:'-(

Then indeed we will need a third method- but I think a slightly modification of the query on the perl script may work, ignoring those parameters. Let me know if I can help.

Some host may not work at all- multisource host like labs/clouddb may not be able to use heartbeat at all, as they need different rows for different replication channels.

For the rest, something like may be imperfect but could work:

SELECT greatest(0, TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) - 500000) AS lag FROM heartbeat.heartbeat ORDER BY ts DESC LIMIT 1;

Considerations:

A non-primary master instance may be running heartbeat (e.g. all parsercache nodes, section masters in backup DC)
- Ignoring any heartbeat entry generated by the local instance causes all primary masters to show an arbitrary amount of lag, as they will only evaluate stale entries in the heartbeat table.
Solution must allow circular replication, like we do leading up to DC switchovers
Solution must not assume primary master is in the MW primary DC.
- This would have broken for the misc sections when MW was in CODFW recently.
Ideally an instance would display lag relative to it's DC-local master.
- Otherwise if master in the secondary DC is lagging, _all_ instances in that DC will show lag, which just adds a lot of noise for no information gain.

If there are no obsolete entries in the heartbeat table, then we can simply do MAX(NOW()-ts) (roughly).

Ways to achieve this:

Clean up the heartbeat table so that it only contains entries that are supposed to be there (T268336: Cleanup heartbeat.heartbeat on all production instances)
- With this, we can use the oldest entry in the heartbeat table to measure the current lag, because all entries are relevant.
- This will work for both the current situation, circular replication, and active-active.
Only run pt-heartbeat on primary masters
- With this, we can use the newest entry in the heartbeat table to measure the current lag, because none of the other entries matter.
- This will work for the current situation, but will not for circular replication/active-active, as we're back to having >1 entry that matters.

herron triaged this task as Medium priority.Nov 20 2020, 3:08 PM

LSobanski moved this task from Triage to Refine on the DBA board.Nov 24 2020, 12:48 PM

Change 642379 merged by Kormat:
[operations/puppet@production] orchestrator: Use heartbeat table to detect lag.

https://gerrit.wikimedia.org/r/642379

The orchestrator config change has been deployed, and the heartbeat tables for pc{1,2,3} have been cleaned up. Other sections will need similar cleanups before orchestrator can manage them properly.

In T268316#6647820, @Kormat wrote:

The orchestrator config change has been deployed, and the heartbeat tables for pc{1,2,3} have been cleaned up. Other sections will need similar cleanups before orchestrator can manage them properly.

We can test it probably by stopping the intermediate master replication and see how orchestrator finds it (although we'd need to move pc2010 back under pc2007 as it was moved up to be a sibling for 10.4.17 testing)

I've tested it in pontoon — stopping heartbeat on the master causes immediate lag to show up for the entire tree.

\o/

(and re-starting heartbeat makes the lag disappear ~instantly)

Maintenance_bot removed a project: Patch-For-Review.Nov 25 2020, 10:10 AM

Marostegui closed subtask T268336: Cleanup heartbeat.heartbeat on all production instances as Resolved.May 6 2021, 7:27 AM

Base replication lag detection on heartbeatClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Base replication lag detection on heartbeat
Closed, ResolvedPublic
Actions

Related Objects
Search...