Page MenuHomePhabricator

Mysterious replication lag observed by MW in Codfw
Closed, ResolvedPublic

Description

As part of trying to verify whether T218692 is fixed, I found the following unexpected messages in the logs:

2020-03-25T14:15:43 Database is read-only: The master database server is running in read-only mode.
…
2020-03-25T14:14:30 Server {host} has 15.94904589653 seconds of lag (>= 6)
2020-03-25T14:14:30 Server {host} has 15.946383953094 seconds of lag (>= 6)
2020-03-25T14:14:30 Server {host} has 15.943606853485 seconds of lag (>= 6)
2020-03-25T14:14:30 Server {host} has 15.940775871277 seconds of lag (>= 6)
2020-03-25T14:14:30 Server {host} has 15.938050985336 seconds of lag (>= 6)
2020-03-25T14:14:30 Server {host} has 15.935436010361 seconds of lag (>= 6)
2020-03-25T14:14:30 Server {host} has 15.932633876801 seconds of lag (>= 6)
2020-03-25T14:14:30 Server {host} has 15.929592847824 seconds of lag (>= 6)
2020-03-25T14:14:30 LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
2020-03-25T14:14:22 LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode

It's unexpected that when the DB master is read-only, that Rdbms claims we're read-only because of lagged replicas. However, given that the main write query did fail correctly with a "DB is read-only" message, I've marked T218692 as resolved.

The other messages looked like perhaps a regression from T216496, which was about MW wrongly reporting "lagged replica mode" when there was no master DB.

This is not the case either. It is not wrongly reporting "lagged replica mode". Rather, it is in fact reporting real replication lag and it can be seen above that the code tried many different servers in codfw, all reporting replication lag.

But, @Marostegui confirmed there is no real lag. Not between codfw-master and its replicas, and also not relative to eqiad-master. All well under a second.

So the question is:

  • Where are these values coming from?
  • Are they result of a bug? If so, we need to fix it.
  • Are they real but mean something else? Then we need to rephase this message and then also make sure that it doesn't result in replica-hopping if it isn't real lag.

Event Timeline

For the record, confirmed no lag shown on pt-heartbeat table entries or show slave status\G.

there is no real lag

I've already written many dissertations :-) about why this happens, you will find them on many other tickets. This happens on eqiad, too.

I cannot say about the specifics of this case, but I can repeat the underlying architecture misconceptions. Lag is measured in application-server wall clock time, not database time. Databases work in database time (e.g. freezed at the start of a commit). Any differences in both (e.g. mediawiki application overload causing it to "pause/freeze" execution, clock mismatches, or anything similar (that is why I cannot say the specifics on this case), will report nonexistent lag.

We have code in various places that ask for the "lag" of the data they are about to process. For example, to inform a TTL for a cache or long-term storage. In some conditions there, no matter what the reason is, if we cannot be sure it is fresh enough, then we don't want to store it or not as long.

For those use cases, we do need to take into account I think the repeatable-read snapshot . That is "real" lag from the app perspective. It is "real" that the rows it reads may be outdated by upto that much.

However, there are also cases where we use lag to decide which replica to connect to, and to sometimes induce artificial read-only mode if we are "too lagged". For those use cases, we would be incorrect to account for repeatable-read and other reasons why of perceived lag other than actual "real" replication lag.

I'm not sure, but my suspicion is that perhaps we are not treating these as separate concepts and thus causes the confusing log messages we see, as well as as the eratic behaviour of users randomly being denied editing of pages due to some slow request perhaps inducing read-only mode for a short time.

Just to be clear, I know there is a reason to it- but it is something to have into account when logging/taking decisions. Note it is *not only* REPEATABLE READ. Requests cannot guarantee real time processing (or pseudo-real-time), and they sometimes read from the db, get stalled for X reason, and then process the data on the app server with a delay. I don't have a solution for that, but I know it is an issue.

Anomie added a subscriber: Anomie.

It seems unlikely that anyone will be able to debug this unless they catch it actively happening. At the moment, that does not seem to be the case.

anomie@mwmaint2001:~$ for s in /srv/mediawiki/dblists/s?.dblist; do WIKI=$(expanddblist $s | head -1); echo "=== $WIKI ==="; echo 'var_export( MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancer()->getLagTimes() );' | mwscript eval.php --wiki=$WIKI; done
=== enwiki ===
array (
  0 => 0,
  1 => 0.3927140235900879,
  2 => 0.39432597160339355,
  3 => 0.39606308937072754,
  4 => 0.39774394035339355,
  5 => 0.3994600772857666,
  6 => 0.40134215354919434,
  7 => 0.40333104133605957,
  8 => 0.4052579402923584,
)

=== bgwiki ===
array (
  0 => 0,
  1 => 0.8811149597167969,
  2 => 0.8836758136749268,
  3 => 0.8856649398803711,
  4 => 0.8880648612976074,
  5 => 0.8910388946533203,
  6 => 0.8933279514312744,
)

=== aawiki ===
array (
  0 => 0,
  1 => 0.22917795181274414,
  2 => 0.23196697235107422,
  3 => 0.23422503471374512,
)

=== commonswiki ===
array (
  0 => 0,
  1 => 0.5892860889434814,
  2 => 0.5920021533966064,
  3 => 0.5947310924530029,
  4 => 0.5971100330352783,
  5 => 0.5994229316711426,
  6 => 0.6016011238098145,
)

=== cebwiki ===
array (
  0 => 0,
  1 => 0.9192631244659424,
  2 => 0.9220421314239502,
  3 => 0.9246621131896973,
  4 => 0.9277920722961426,
  5 => 0.9305150508880615,
  6 => 0.9327740669250488,
)

=== frwiki ===
array (
  0 => 0,
  1 => 0.28902387619018555,
  2 => 0.2922019958496094,
  3 => 0.29476284980773926,
  4 => 0.2970387935638428,
  5 => 0.2998929023742676,
  6 => 0.30250000953674316,
)

=== arwiki ===
array (
  0 => 0,
  1 => 0.649118185043335,
  2 => 0.6520121097564697,
  3 => 0.6540911197662354,
  4 => 0.6561851501464844,
  5 => 0.658250093460083,
  6 => 0.6603481769561768,
)

=== wikidatawiki ===
array (
  0 => 0,
  1 => 0.5729179382324219,
  2 => 0.574760913848877,
  3 => 0.5767829418182373,
  4 => 0.5786409378051758,
  5 => 0.5804879665374756,
  6 => 0.5824511051177979,
)

As such, I'm going to untag Platform Engineering for now.

Is it possible that there was real lag at 14:14:30, but it had resolved itself by the time @Marostegui checked it?

Another remote possibility would be clock drift between the (PHP) host in question and pt-heartbeat, as it appears the lag is currently calculated by subtracting the pt-heartbeat timestamp from the local server's current time (as of just before the query for pt-heartbeat was sent to the DB).

Is it possible that there was real lag at 14:14:30, but it had resolved itself by the time @Marostegui checked it?

From what I can see db1105:3311 peaked lag (or the graph shows so at around that time https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1105&var-port=13311&fullscreen&panelId=6&from=1585144774413&to=1585146868501

This might or might be not a coincidence. For the hosts we were checking in codfw, we saw no lag at all and we checked almost at the same time it was reported (as we were troubleshooting together on IRC).

We often have reports of all replicas lagging, which is not reflected anywhere: alerts, graphs etc...this is a recent example for today: https://logstash.wikimedia.org/goto/69912fb95be6eec0f4d8528479c4f28c
That shows lag around 08:20 but there is nothing there: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1585119650273&to=1585128615773

Another remote possibility would be clock drift between the (PHP) host in question and pt-heartbeat, as it appears the lag is currently calculated by subtracting the pt-heartbeat timestamp from the local server's current time.

It is improbable that if there's a time drift between those two..it would be just a few ms, and no more I would say - but I haven't checked it.

Removing the DBA tag as there's no actionable for us here for now, but staying subscribed just in case.

Change 657471 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: use server time in DatabaseMysqlBase::getLagFromPtHeartbeat()

https://gerrit.wikimedia.org/r/657471

Krinkle triaged this task as Low priority.
Krinkle added projects: DBA, Performance-Team.

In a recent pairing session @aaron noticed that with MySQL < 5.6 no longer supported, we no longer need the perform fragile comparison of timestamps that originate from two different servers (app server and db server). See above patch which restores what we had originally some years ago, which is to select with TIMESTAMPDIFF() from the heartbeat tables.

@jcrespo The above might be of interest :-)

In a recent pairing session @aaron noticed that with MySQL < 5.6 no longer supported, we no longer need the perform fragile comparison of timestamps that originate from two different servers (app server and db server). See above patch which restores what we had originally some years ago, which is to select with TIMESTAMPDIFF() from the heartbeat tables.

Supported by who?

https://www.mediawiki.org/wiki/Compatibility still says 5.5.8+, and MysqlInstaller concurs :)

	public static $minimumVersion = '5.5.8';

And it looks like 5.5 hasn't been supported by upstream since December 2018...

DatabaseMysqli.php (Gerrit change 657471)
- // Note: this would use "TIMESTAMPDIFF(MICROSECOND,ts,UTC_TIMESTAMP(6))" but the
- // percision field is not supported in MySQL <= 5.5.

I assumed the functionality was added in a late 5.5.x release that we require and/or that we indeed dropped 5.5.x already. Neither is the case, I traced the UTC_TIMESTAMP parameter to MySQL 5.6.4 (Dec 2011).

[…] it looks like 5.5 hasn't been supported by upstream since December 2018...

Filed T273375.

Change 684001 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@REL1_36] rdbms: Use server time in DatabaseMysqlBase::getLagFromPtHeartbeat()

https://gerrit.wikimedia.org/r/684001

Change 657471 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Use server time in DatabaseMysqlBase::getLagFromPtHeartbeat()

https://gerrit.wikimedia.org/r/657471

Change 684001 merged by jenkins-bot:

[mediawiki/core@REL1_36] rdbms: Use server time in DatabaseMysqlBase::getLagFromPtHeartbeat()

https://gerrit.wikimedia.org/r/684001