Page MenuHomePhabricator

MediaWiki replication lags are inflated with average of half a second
Open, Needs TriagePublic

Assigned To
Authored By
Wed, Jan 25, 4:40 AM
Referenced Files
F36487485: grafik.png
Wed, Jan 25, 8:32 AM
F36487477: grafik.png
Wed, Jan 25, 8:29 AM
F36487470: grafik.png
Wed, Jan 25, 8:29 AM
F36487467: grafik.png
Wed, Jan 25, 8:29 AM
F36487464: grafik.png
Wed, Jan 25, 8:29 AM
"The World Burns" token, awarded by Marostegui.


How pt-heartbeat works:
Once every second (the cadence is important), the service updates a row in heartbeat.heartbeat table in the master and lets it replicate. Here are two examples of SQL commands issued:

REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2023-01-25T01:16:36.000820', '171966508', 'db1157-bin.002873', '1933994', NULL, NULL, 's3', 'eqiad')
REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2023-01-25T01:16:37.000960', '171966508', 'db1157-bin.002873', '2268334', NULL, NULL, 's3', 'eqiad')

The timestamp, is the timestamp of SQL command (it's not using NOW() because then it would make it useless in SBR)

How mediawiki calculates the replication lag:
Now, when mediawiki wants to look up replication lag, it makes this query:


This has so many issues:

  • It's the time from the last heartbeat, not from the last landed query. e.g. if replag is 0.2 second (mostly replag is around this number), mw would report anything from 0.2s to 1.2s at random. That averages to half a second of error and that's why our internal db tools takes off 0.5s from the value (and max it with zero[1])
  • We have repeatable read. If for whatever reason you read the value again, you might get an extremely large replag, I checked and couldn't find CONN_TRX_AUTOCOMMIT being used in getting replag (using that would be quite expensive)
  • The timestamp of insert is based on the clock of the primary host, while the query checks the value is based on the clock of the replica. No two separate hosts have fully synchronized clocks. We could minimize it by buying atomic clocks and having a timeserver in each one of our datacenters (Google does it for some other distributed works) but doing it here feels like mowing your lawn with a tank.
  • Given all the above, microsecond precision is completely unnecessary. Even with 0.5s removal, the replag values has error rate of five orders of magnitude bigger the microsecond precision.

[1] maxing with zero combined with the fact that most of the time replag is below 0.5 means if want to minimize the error in values you'd have to pick a different number than 0.5s, the math for it gets complicated and would require calculus of variation. I can try to do it but not sure if it's worth it specially given that any sort of calculation would require either assuming replag is a normal distribution between foo and bar which is a big assumption or collecting data and get some sense of its distribution but again, is it really worth it?

My suggestion to move forward: Reduce the precision to millisecond and remove 0.5s from the value (and max with zero).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I got nerd sniped into figuring out what would be the average error if the actual replication lag is a constant smaller than value removed (suggestion here being 0.5s). If the replag is X and the value we are reducing from every report would be Y. The average error would be something close to this:

grafik.png (61×192 px, 2 KB)

Which is a solution to this general integration (for expected value):

grafik.png (52×235 px, 2 KB)

then assuming X < Y. It turns into:

grafik.png (64×222 px, 3 KB)

When you solve it, you'd get:

grafik.png (78×212 px, 3 KB)

And assuming (Y-X)^2 would be << 1, you can simplify it and get the first formula.

Unfortunately, trying to find an optima for the error values by taking a derivative of the above formula over Y and make it equal to zero didn't yield any results (I might be missing something, my math is quite rusty and I was notoriously clumsy in doing calculations in university). We can also solve it numerically if we can get a rough value of X which shouldn't be too hard.