With or without the master having the ability to accept read queries, at least one db averages higher usage. If the master can accept read queries (1:1:1:+), the master is always higher. If it can't (0:1:1:+) one of the replicas is higher than the others and is not equitably distributed between themselves. Here are some screenshots for comparison.
Description
Event Timeline
Here are metrics from where the master does not accept read queries, you can see where the change was made and the uneven load of read query between the two replicas
To investigate this, I would need more details:
- MediaWiki version
- PHP version, and which db driver(s) are installed.
- Which of the MediaWiki configuration variables relating to DB and LB are set, and their values (all except specific hostnames/passwords).
Also, given way state is shared between web requests, it the object cache setup may also play a role here:
- Is APCu installed?
- What are $wgMainCacheType, $wgMainWANCache and $wgMainStash set to. And if their values are not a default key of $wgObjectCaches, then the value of that array as well.
- 1.34.0
- 7.2.24-0ubuntu0.18.04.2 (apache2handler)
- 10.1.34-MariaDB
<?php # Single #require_once( 'LocalSettings.DB-single.php' ); # Replica require_once( 'LocalSettings.DB-replica2.php' );
<?php $wgDBname = "ffxiv_wiki"; $wgDBservers = array( array( 'host' => "", 'dbname' => "", 'user' => "", 'password' => "", 'type' => "mysql", 'flags' => DBO_SSL, 'load' => 0, ), array( 'host' => "", 'dbname' => "", 'user' => "", 'password' => "", 'type' => "mysql", 'flags' => DBO_SSL, 'load' => 1, ), array( 'host' => "", 'dbname' => "", 'user' => "", 'password' => "", 'type' => "mysql", 'flags' => DBO_SSL, 'load' => 1, ), );
Is APCu installed? Yes
<?php $wgMainCacheType = CACHE_MEMCACHED; $wgMainWANCache (not set, using default) $wgMainStash (not set, using default)
There are a few things that might contribute here:
- With 'load' being 1, if anything happens that causes MW to try to scale back it can only scale back to 0. You might try setting them all to 100 instead.
- Servers that are lagged are avoided. When this happens, it looks like it gets logged at the "debug" level to the DBReplication channel.
- I've verified that the selection based on adjusted weights is fair: ArrayUtils::pickRandom() does not show bias.
Further testing on my part will have to wait until I have time to figure out how to set up replication locally to more directly test things. In the mean time, you might try repeatedly using maintenance/eval.php or maintenance/shell.php to execute wfGetDB( DB_REPLICA )->getLBInfo() to more directly measure the selection of connections.
Something something documentation.
https://www.mediawiki.org/wiki/Manual:$wgDBservers doesn't really go into detail how the values for load work other than an example of 0 and 1. So when you suggest something like setting values to 100 it doesn't really convert in my head well.
Yeah, documentation often is lacking. :( And the DB configuration in MediaWiki could probably really use an overhaul, I think there are two or three different systems of configuration thanks to people adding things over the years but never deprecating the old ways.
The real question is whether setting them to 100 makes a difference for you, or if it still behaves the same. That could help point us in the right direction.
I was unable to confirm this behavior.
When I had two "replicas" and the master all with load 1, 5000 trial connections chose each of the three approximately 1/3 of the time. With two "replicas" at load 1 and the master at 0, 5000 trial connections again chose each replica approximately 1/2 of the time.
Note, BTW, that MediaWiki does not attempt to balance open connections or CPU utilization other than to the extent that those affect replication lag.
[1]: Not "new connections"
Marking as Resolved as it is in the Done column. Feel free to reopen if there is remaining work.
@CCicalese_WMF: I don't see how "unable to reproduce" means "resolved" hence changing task status.



