Page MenuHomePhabricator

MediaWiki should query master instead of replica if replica is too lagged
Closed, ResolvedPublic

Description

Not sure, if I get the problem right, so I'll try to provide as most information as possible :)

The wiki:
My wiki runs on two nodes, both has a php interpretor, as well as a MariaDB database, from which the first one is the master and replicates to the second one. The traffic is load balanced to both php interpretors through a Nginx loadbalancer, the database traffic is distributed using a db [[ https://github.com/droidwiki/operations-mediawiki-config/blob/master/db.php | configuration with LBFactoryMulti ]] where both MariaDB instances should get equal load.

The state before:
My wiki ran on MediaWiki 1.34.0-wmf.8 and because of some other issues, the replication from time to time fall behind leaving the replicated host some seconds behind the master (more than the configured 6). This resulted in MediaWiki leaving the db1002 db out of rotation and routing all traffic to the master instance. So far so good.

The state now:
After upgrading the wiki to MediaWiki 1.34.0-wmf.11 (direct upgrade, no intermediate versions), the state changed dangerously. Now, when the replication falls behind the configured maximum lag time, MediaWiki seems to does not take the replica out of rotation anymore. Instead, it seems to try to get a connection to the replica over and over again (that's at least what I read out of the logs).

The mediawiki logger logs the following lines (over and over again, for each request thousands of hundreds of lines repeating this):

[2019-07-07 12:17:27] DBConnection.DEBUG: Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: Using reader #0: db1001... [] {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"}
[2019-07-07 12:17:27] DBReplication.DEBUG: Wikimedia\Rdbms\LoadMonitor::getServerStates: got lag times (global:lag-times:1:db1001:0-1) from local cache [] {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"}
[2019-07-07 12:17:27] DBReplication.DEBUG: Wikimedia\Rdbms\LoadMonitor::getServerStates: got lag times (global:lag-times:1:db1001:0-1) from local cache [] {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"}
[2019-07-07 12:17:27] DBReplication.DEBUG: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1002 is not replicating? {"host":"db1002"} {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"}

Meanwhile, the php-fpm process exhausts the memory limit and get's restarted, resulting in some minor downtime:

[07-Jul-2019 12:34:05] WARNING: [pool www] child 1954 exited on signal 11 (SIGSEGV - core dumped) after 147636.181534 seconds from start
[07-Jul-2019 12:34:05] NOTICE: [pool www] child 29782 started

The nginx load balancer will return a 500 Internal server error to the client after some time (not directly after the php-fpm restarted and cancelled the request, which means, that the user has to wait a long tim before an error occurs, this could be related to my own configuration but somehow makes the problem a bit worse), which looks like this:

2019/07/07 12:16:52 [error] 20690#20690: *744602 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?title=HTC/Incredible_S&action=history HTTP/1.1", upstream: "fastcgi://172.16.0.2:9000", host: "www.droidwiki.org"
2019/07/07 12:16:59 [error] 20691#20691: *744554 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?title=Datei:Screenshot_Symbol_Samsung.png&action=info HTTP/1.1", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org"
2019/07/07 12:17:05 [error] 20691#20691: *744562 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /wiki/Spezial:Letzte_%C3%84nderungen?hidebots=1&translations=filter&hideWikibase=1&limit=50&days=90&urlversion=2 HTTP/1.0", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org", referrer: "https://www.droidwiki.org/wiki/Spezial:Version"
2019/07/07 12:17:21 [error] 20691#20691: *744577 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?oldid=14353&printable=yes&title=Datei%3ASony_Xperia_Acro_S.jpg&veaction=edit HTTP/1.1", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org"
2019/07/07 12:17:25 [error] 20691#20691: *744581 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57
PHP message: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/exception/MWExceptionHandler.php on line 374" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?action=edit&section=2&title=Samsung/Galaxy_Ace HTTP/1.1", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org"

To be honest, I'm not entirely sure, if this is related to the MediaWiki database component, however, from my investigation it could be a starting point to dig deeper to find the underlying problem.

I couldn't also find any reference in the RELEASE-NOTES which could result in such a problem.

Disclaimer: This whole problem could be related to my own configuration, which is why I linked it above. However, it would be nice if we could find the change in the MediaWiki core, which made the problem occuring in such a strange way, where the older wmf version worked fine with it.

Event Timeline

Krinkle renamed this task from MediaWiki does not select master if reader lags for viewing pages to MediaWiki should query master instead of replica if replica is too lagged.Jul 23 2019, 7:09 PM
Anomie subscribed.

However, it would be nice if we could find the change in the MediaWiki core

Since it seems you can reproduce the problem, you can help by doing this part! You can do this as follows:

  1. If your wiki isn't running directly out of a git checkout (e.g. you're using a tarball), arrange for them to be running from a git checkout.
  2. Get your replica into the state that reproduces the issue.
  3. In the git checkout, execute git bisect start.
  4. Check out 1.34.0-wmf.8. Verify that the problem is not occurring. Execute git bisect good.
  5. Check out 1.34.0-wmf.11. Verify that the problem is occurring. Execute git bisect bad.
  6. Now git will have automatically checked out a version somewhere in between those two. Test whether the problem occurs or not in that version, and execute git bisect bad or git bisect good accordingly. If the version checked out is broken in some what that prevents you from testing it, execute git bisect skip.
  7. Repeat until git identifies the problem commit.
  8. Execute git bisect reset to clean up.

If you can do that, that would be really helpful in getting your issue resolved. If not, it'll have to wait until some else can reproduce it and do that debugging, or until someone just happens to figure out what changed in some other way.

@Anomie You're absolutely right, I did not think about bisecting the problem *thumbsup*

Sorry for the late reply, it was a stressy week and I needed to take some additional time to get an environment running where I can reproduce the problem without harming my production environment. For furute reference (especially for myself), MySQLs CHANGE MASTER TO MASTER_DELAY = seconds; was of great help here to simulate a lag on a replica :)

After bisecting a bit, git outputted this revision as being the first bad one, so this one is most likely the problem:
rMW3cbb232c5397: rdbms: merge openConnection into getConnection in LoadBalancer

@aaron The problem seems to be fixed in wmf.14, so I tried to track the commit which fixed this issue and git actually confirmed your guess, the problem was fixed with rMW79d1881eded1: rdbms: avoid recursion in LoadBalancer when the master has non-zero load.