Page MenuHomePhabricator

$wgDBservers does not always equally distribute load
Closed, DeclinedPublic

Assigned To
None
Authored By
charitwo
Mar 7 2020, 2:54 PM
Referenced Files
F31670014: image.png
Mar 7 2020, 5:18 PM
F31670016: image.png
Mar 7 2020, 5:18 PM
F31669888: image.png
Mar 7 2020, 2:54 PM
F31669886: image.png
Mar 7 2020, 2:54 PM

Description

With or without the master having the ability to accept read queries, at least one db averages higher usage. If the master can accept read queries (1:1:1:+), the master is always higher. If it can't (0:1:1:+) one of the replicas is higher than the others and is not equitably distributed between themselves. Here are some screenshots for comparison.

image.png (539×868 px, 132 KB)

image.png (506×883 px, 67 KB)

Event Timeline

Here are metrics from where the master does not accept read queries, you can see where the change was made and the uneven load of read query between the two replicas

image.png (494×877 px, 216 KB)
image.png (515×897 px, 63 KB)

To investigate this, I would need more details:

  • MediaWiki version
  • PHP version, and which db driver(s) are installed.
  • Which of the MediaWiki configuration variables relating to DB and LB are set, and their values (all except specific hostnames/passwords).

Also, given way state is shared between web requests, it the object cache setup may also play a role here:

  • Is APCu installed?
  • What are $wgMainCacheType, $wgMainWANCache and $wgMainStash set to. And if their values are not a default key of $wgObjectCaches, then the value of that array as well.
  • 1.34.0
  • 7.2.24-0ubuntu0.18.04.2 (apache2handler)
  • 10.1.34-MariaDB
<?php
# Single
#require_once( 'LocalSettings.DB-single.php' );
# Replica
require_once( 'LocalSettings.DB-replica2.php' );
<?php

$wgDBname = "ffxiv_wiki";

$wgDBservers = array(
         array(
                 'host' => "",
                 'dbname' => "",
                 'user' => "",
                 'password' => "",
                 'type' => "mysql",
                 'flags' => DBO_SSL,
                 'load' => 0,
         ),
         array(
                 'host' => "",
                 'dbname' => "",
                 'user' => "",
                 'password' => "",
                 'type' => "mysql",
                 'flags' => DBO_SSL,
                 'load' => 1,
         ),
         array(
                 'host' => "",
                 'dbname' => "",
                 'user' => "",
                 'password' => "",
                 'type' => "mysql",
                 'flags' => DBO_SSL,
                 'load' => 1,
         ),

);

Is APCu installed? Yes

<?php
$wgMainCacheType = CACHE_MEMCACHED;
$wgMainWANCache (not set, using default)
$wgMainStash (not set, using default)

Thanks, I'll let CPT triage this further.

There are a few things that might contribute here:

  • With 'load' being 1, if anything happens that causes MW to try to scale back it can only scale back to 0. You might try setting them all to 100 instead.
  • Servers that are lagged are avoided. When this happens, it looks like it gets logged at the "debug" level to the DBReplication channel.
  • I've verified that the selection based on adjusted weights is fair: ArrayUtils::pickRandom() does not show bias.

Further testing on my part will have to wait until I have time to figure out how to set up replication locally to more directly test things. In the mean time, you might try repeatedly using maintenance/eval.php or maintenance/shell.php to execute wfGetDB( DB_REPLICA )->getLBInfo() to more directly measure the selection of connections.

Something something documentation.

https://www.mediawiki.org/wiki/Manual:$wgDBservers doesn't really go into detail how the values for load work other than an example of 0 and 1. So when you suggest something like setting values to 100 it doesn't really convert in my head well.

Yeah, documentation often is lacking. :( And the DB configuration in MediaWiki could probably really use an overhaul, I think there are two or three different systems of configuration thanks to people adding things over the years but never deprecating the old ways.

The real question is whether setting them to 100 makes a difference for you, or if it still behaves the same. That could help point us in the right direction.

Anomie removed Anomie as the assignee of this task.Apr 1 2020, 6:44 PM
Anomie subscribed.

I was unable to confirm this behavior.

When I had two "replicas" and the master all with load 1, 5000 trial connections chose each of the three approximately 1/3 of the time. With two "replicas" at load 1 and the master at 0, 5000 trial connections again chose each replica approximately 1/2 of the time.

Note, BTW, that MediaWiki does not attempt to balance open connections or CPU utilization other than to the extent that those affect replication lag.

[1]: Not "new connections"

CCicalese_WMF claimed this task.
CCicalese_WMF subscribed.

Marking as Resolved as it is in the Done column. Feel free to reopen if there is remaining work.

Aklapper changed the task status from Resolved to Declined.Apr 29 2020, 10:06 AM
Aklapper removed CCicalese_WMF as the assignee of this task.

@CCicalese_WMF: I don't see how "unable to reproduce" means "resolved" hence changing task status.