Not sure, if I get the problem right, so I'll try to provide as most information as possible :)
The wiki:
My wiki runs on two nodes, both has a php interpretor, as well as a MariaDB database, from which the first one is the master and replicates to the second one. The traffic is load balanced to both php interpretors through a Nginx loadbalancer, the database traffic is distributed using a db [[ https://github.com/droidwiki/operations-mediawiki-config/blob/master/db.php | configuration with LBFactoryMulti ]] where both MariaDB instances should get equal load.
The state before:
My wiki ran on MediaWiki 1.34.0-wmf.8 and because of some other issues, the replication from time to time fall behind leaving the replicated host some seconds behind the master (more than the configured 6). This resulted in MediaWiki leaving the db1002 db out of rotation and routing all traffic to the master instance. So far so good.
The state now:
After upgrading the wiki to MediaWiki 1.34.0-wmf.11 (direct upgrade, no intermediate versions), the state changed dangerously. Now, when the replication falls behind the configured maximum lag time, MediaWiki seems to does not take the replica out of rotation anymore. Instead, it seems to try to get a connection to the replica over and over again (that's at least what I read out of the logs).
The mediawiki logger logs the following lines (over and over again, for each request thousands of hundreds of lines repeating this):
[2019-07-07 12:17:27] DBConnection.DEBUG: Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: Using reader #0: db1001... [] {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"} [2019-07-07 12:17:27] DBReplication.DEBUG: Wikimedia\Rdbms\LoadMonitor::getServerStates: got lag times (global:lag-times:1:db1001:0-1) from local cache [] {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"} [2019-07-07 12:17:27] DBReplication.DEBUG: Wikimedia\Rdbms\LoadMonitor::getServerStates: got lag times (global:lag-times:1:db1001:0-1) from local cache [] {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"} [2019-07-07 12:17:27] DBReplication.DEBUG: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1002 is not replicating? {"host":"db1002"} {"url":"/w/index.php?action=info&title=Item_talk%3AQ27","ip":"46.229.168.161","http_method":"GET","server":"data.droidwiki.org","referrer":null,"uid":"6e7715c","process_id":28062,"host":"v22015052656325188","wiki":"datawiki","mwversion":"1.34.0-wmf.11","reqId":"b444e7cf74122be7a40722cf"}
Meanwhile, the php-fpm process exhausts the memory limit and get's restarted, resulting in some minor downtime:
[07-Jul-2019 12:34:05] WARNING: [pool www] child 1954 exited on signal 11 (SIGSEGV - core dumped) after 147636.181534 seconds from start [07-Jul-2019 12:34:05] NOTICE: [pool www] child 29782 started
The nginx load balancer will return a 500 Internal server error to the client after some time (not directly after the php-fpm restarted and cancelled the request, which means, that the user has to wait a long tim before an error occurs, this could be related to my own configuration but somehow makes the problem a bit worse), which looks like this:
2019/07/07 12:16:52 [error] 20690#20690: *744602 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?title=HTC/Incredible_S&action=history HTTP/1.1", upstream: "fastcgi://172.16.0.2:9000", host: "www.droidwiki.org" 2019/07/07 12:16:59 [error] 20691#20691: *744554 FastCGI sent in stderr: "PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?title=Datei:Screenshot_Symbol_Samsung.png&action=info HTTP/1.1", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org" 2019/07/07 12:17:05 [error] 20691#20691: *744562 FastCGI sent in stderr: "PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /wiki/Spezial:Letzte_%C3%84nderungen?hidebots=1&translations=filter&hideWikibase=1&limit=50&days=90&urlversion=2 HTTP/1.0", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org", referrer: "https://www.droidwiki.org/wiki/Spezial:Version" 2019/07/07 12:17:21 [error] 20691#20691: *744577 FastCGI sent in stderr: "PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?oldid=14353&printable=yes&title=Datei%3ASony_Xperia_Acro_S.jpg&veaction=edit HTTP/1.1", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org" 2019/07/07 12:17:25 [error] 20691#20691: *744581 FastCGI sent in stderr: "PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/libs/objectcache/APCUBagOStuff.php on line 57 PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /data/mediawiki/main/includes/exception/MWExceptionHandler.php on line 374" while reading response header from upstream, client: 172.16.0.1, server: www.droidwiki.org, request: "GET /w/index.php?action=edit§ion=2&title=Samsung/Galaxy_Ace HTTP/1.1", upstream: "fastcgi://172.16.0.1:9000", host: "www.droidwiki.org"
To be honest, I'm not entirely sure, if this is related to the MediaWiki database component, however, from my investigation it could be a starting point to dig deeper to find the underlying problem.
I couldn't also find any reference in the RELEASE-NOTES which could result in such a problem.
Disclaimer: This whole problem could be related to my own configuration, which is why I linked it above. However, it would be nice if we could find the change in the MediaWiki core, which made the problem occuring in such a strange way, where the older wmf version worked fine with it.