Page MenuHomePhabricator

MediaWiki replication lags are inflated with average of half a second
Open, Needs TriagePublic

Assigned To
None
Authored By
Ladsgroup
Jan 25 2023, 4:40 AM
Referenced Files
F36487485: grafik.png
Jan 25 2023, 8:32 AM
F36487477: grafik.png
Jan 25 2023, 8:29 AM
F36487470: grafik.png
Jan 25 2023, 8:29 AM
F36487467: grafik.png
Jan 25 2023, 8:29 AM
F36487464: grafik.png
Jan 25 2023, 8:29 AM
Tokens
"The World Burns" token, awarded by Marostegui.

Description

How pt-heartbeat works:
Once every second (the cadence is important), the service updates a row in heartbeat.heartbeat table in the master and lets it replicate. Here are two examples of SQL commands issued:

REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2023-01-25T01:16:36.000820', '171966508', 'db1157-bin.002873', '1933994', NULL, NULL, 's3', 'eqiad')
REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2023-01-25T01:16:37.000960', '171966508', 'db1157-bin.002873', '2268334', NULL, NULL, 's3', 'eqiad')

The timestamp, is the timestamp of SQL command (it's not using NOW() because then it would make it useless in SBR)

How mediawiki calculates the replication lag:
Now, when mediawiki wants to look up replication lag, it makes this query:

SELECT TIMESTAMPDIFF(MICROSECOND,ts,UTC_TIMESTAMP(6)) AS us_ago FROM heartbeat.heartbeat WHERE $where ORDER BY ts DESC LIMIT 1

This has so many issues:

  • It's the time from the last heartbeat, not from the last landed query. e.g. if replag is 0.2 second (mostly replag is around this number), mw would report anything from 0.2s to 1.2s at random. That averages to half a second of error and that's why our internal db tools takes off 0.5s from the value (and max it with zero[1])
  • We have repeatable read. If for whatever reason you read the value again, you might get an extremely large replag, I checked and couldn't find CONN_TRX_AUTOCOMMIT being used in getting replag (using that would be quite expensive)
  • The timestamp of insert is based on the clock of the primary host, while the query checks the value is based on the clock of the replica. No two separate hosts have fully synchronized clocks. We could minimize it by buying atomic clocks and having a timeserver in each one of our datacenters (Google does it for some other distributed works) but doing it here feels like mowing your lawn with a tank.
  • Given all the above, microsecond precision is completely unnecessary. Even with 0.5s removal, the replag values has error rate of five orders of magnitude bigger the microsecond precision.

[1] maxing with zero combined with the fact that most of the time replag is below 0.5 means if want to minimize the error in values you'd have to pick a different number than 0.5s, the math for it gets complicated and would require calculus of variation. I can try to do it but not sure if it's worth it specially given that any sort of calculation would require either assuming replag is a normal distribution between foo and bar which is a big assumption or collecting data and get some sense of its distribution but again, is it really worth it?

My suggestion to move forward: Reduce the precision to millisecond and remove 0.5s from the value (and max with zero).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I got nerd sniped into figuring out what would be the average error if the actual replication lag is a constant smaller than value removed (suggestion here being 0.5s). If the replag is X and the value we are reducing from every report would be Y. The average error would be something close to this:

grafik.png (61×192 px, 2 KB)

Which is a solution to this general integration (for expected value):

grafik.png (52×235 px, 2 KB)

then assuming X < Y. It turns into:

grafik.png (64×222 px, 3 KB)

When you solve it, you'd get:

grafik.png (78×212 px, 3 KB)

And assuming (Y-X)^2 would be << 1, you can simplify it and get the first formula.

Unfortunately, trying to find an optima for the error values by taking a derivative of the above formula over Y and make it equal to zero didn't yield any results (I might be missing something, my math is quite rusty and I was notoriously clumsy in doing calculations in university). We can also solve it numerically if we can get a rough value of X which shouldn't be too hard.

For context: This came up as part of debugging the root cause of T314975: Rdbms library is too aggressive in changing to read-only.

As I understand it, and correct me if I'm wrong, this half second offset doesn't directly cause increased latency or visible errors. As @Ladsgroup points out, Rdbms only considers the whole number component, and more specifically we really only care about whether it's under or over 5 seconds. That is to say, can we expect observable side-effect from a pathological worst-case scenario where the code always reports a fixed minimum of 0.5 or even 1.0s?

This is certainly something we should clean up, but it's not clear to me whether this played into the (very real problem) of MW frequently going into read-only mode (unless replication lag of 4.x seconds is common).

My suggestion to move forward: Reduce the precision to millisecond and remove 0.5s from the value (and max with zero).

Sounds good to me!