MediaWiki replication lags are inflated with average of half a second
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Ladsgroup
	Jan 25 2023, 4:40 AM

Description

How pt-heartbeat works:
Once every second (the cadence is important), the service updates a row in heartbeat.heartbeat table in the master and lets it replicate. Here are two examples of SQL commands issued:

REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2023-01-25T01:16:36.000820', '171966508', 'db1157-bin.002873', '1933994', NULL, NULL, 's3', 'eqiad')
REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2023-01-25T01:16:37.000960', '171966508', 'db1157-bin.002873', '2268334', NULL, NULL, 's3', 'eqiad')

The timestamp, is the timestamp of SQL command (it's not using NOW() because then it would make it useless in SBR)

How mediawiki calculates the replication lag:
Now, when mediawiki wants to look up replication lag, it makes this query:

SELECT TIMESTAMPDIFF(MICROSECOND,ts,UTC_TIMESTAMP(6)) AS us_ago FROM heartbeat.heartbeat WHERE $where ORDER BY ts DESC LIMIT 1

This has so many issues:

It's the time from the last heartbeat, not from the last landed query. e.g. if replag is 0.2 second (mostly replag is around this number), mw would report anything from 0.2s to 1.2s at random. That averages to half a second of error and that's why our internal db tools takes off 0.5s from the value (and max it with zero[1])
We have repeatable read. If for whatever reason you read the value again, you might get an extremely large replag, I checked and couldn't find CONN_TRX_AUTOCOMMIT being used in getting replag (using that would be quite expensive)
The timestamp of insert is based on the clock of the primary host, while the query checks the value is based on the clock of the replica. No two separate hosts have fully synchronized clocks. We could minimize it by buying atomic clocks and having a timeserver in each one of our datacenters (Google does it for some other distributed works) but doing it here feels like mowing your lawn with a tank.
Given all the above, microsecond precision is completely unnecessary. Even with 0.5s removal, the replag values has error rate of five orders of magnitude bigger the microsecond precision.

[1] maxing with zero combined with the fact that most of the time replag is below 0.5 means if want to minimize the error in values you'd have to pick a different number than 0.5s, the math for it gets complicated and would require calculus of variation. I can try to do it but not sure if it's worth it specially given that any sort of calculation would require either assuming replag is a normal distribution between foo and bar which is a big assumption or collecting data and get some sense of its distribution but again, is it really worth it?

My suggestion to move forward: Reduce the precision to millisecond and remove 0.5s from the value (and max with zero).

Related Objects

Mentioned In: T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding
T358484: maintenance/lag.php and maintenance/getLagTimes.php give confusing output when replication is stopped
T314020: LoadMonitor connection weighting reimagined
T172497: Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency
Mentioned Here: T314975: Rdbms library is too aggressive in changing to read-only

Event Timeline

Ladsgroup created this task.Jan 25 2023, 4:40 AM

Restricted Application added a project: Performance-Team. · View Herald TranscriptJan 25 2023, 4:40 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I got nerd sniped into figuring out what would be the average error if the actual replication lag is a constant smaller than value removed (suggestion here being 0.5s). If the replag is X and the value we are reducing from every report would be Y. The average error would be something close to this:

Which is a solution to this general integration (for expected value):

then assuming X < Y. It turns into:

When you solve it, you'd get:

And assuming (Y-X)^2 would be << 1, you can simplify it and get the first formula.

Unfortunately, trying to find an optima for the error values by taking a derivative of the above formula over Y and make it equal to zero didn't yield any results (I might be missing something, my math is quite rusty and I was notoriously clumsy in doing calculations in university). We can also solve it numerically if we can get a rough value of X which shouldn't be too hard.

Marostegui awarded a token.Jan 25 2023, 8:47 AM

larissagaulia moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Jan 26 2023, 7:12 AM

larissagaulia moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.

Ladsgroup attached a referenced file: F36487485: grafik.png. (Show Details)Jan 30 2023, 11:00 AM

Ladsgroup mentioned this in T172497: Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency.Mar 31 2023, 8:03 PM

For context: This came up as part of debugging the root cause of T314975: Rdbms library is too aggressive in changing to read-only.

As I understand it, and correct me if I'm wrong, this half second offset doesn't directly cause increased latency or visible errors. As @Ladsgroup points out, Rdbms only considers the whole number component, and more specifically we really only care about whether it's under or over 5 seconds. That is to say, can we expect observable side-effect from a pathological worst-case scenario where the code always reports a fixed minimum of 0.5 or even 1.0s?

This is certainly something we should clean up, but it's not clear to me whether this played into the (very real problem) of MW frequently going into read-only mode (unless replication lag of 4.x seconds is common).

My suggestion to move forward: Reduce the precision to millisecond and remove 0.5s from the value (and max with zero).

Sounds good to me!

Ladsgroup mentioned this in T314020: LoadMonitor connection weighting reimagined.Feb 18 2024, 4:56 PM

Ladsgroup mentioned this in T358484: maintenance/lag.php and maintenance/getLagTimes.php give confusing output when replication is stopped.Feb 27 2024, 8:51 AM

Jdforrester-WMF subscribed.Mar 5 2024, 9:01 PM

Ladsgroup mentioned this in T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.Jun 19 2024, 5:53 PM

	F36487477: grafik.png
	Jan 25 2023, 8:29 AM

	F36487470: grafik.png
	Jan 25 2023, 8:29 AM

	F36487467: grafik.png
	Jan 25 2023, 8:29 AM

	F36487464: grafik.png
	Jan 25 2023, 8:29 AM

MediaWiki replication lags are inflated with average of half a secondOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

MediaWiki replication lags are inflated with average of half a second
Open, Needs TriagePublic
Actions