Page MenuHomePhabricator

Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs
Closed, ResolvedPublic

Description

Follow-up from T239874: MediaWiki: "host db1062 is unreachable" (Connection refused).

Specifically, questions to CPT/Perf from DBAs:

  1. Does MediaWiki as used by WMF (wikimedia/rdbms, LBFactoryMulti) support giving the master db non-zero weight in terms of read queries intended for replicas? If so, what would a configuration for that look like?
  1. Is it meant to be valid to have a pooled replica with weight:0. If so, what is its current behaviour in MediaWiki?
    • Are zero-weight replicas waited upon by waitForReplication?
    • If another replicas have too much replication lag and MW is considering to enforce read-only mode, will it consider a zero-weight replica first?
    • If another replica is unavailable / failed to connect and we pick a random different one, does this sometimes pick the zero-weight replica?
    • Are there other cases, aside from waitFor and "regular" queries, where MW will connect to zero-weight replicas to maybe gather certain state information or meta data etc.?

And to DBAs (after the above):

  1. Should we consider changing any of this?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I would definitely like to be able to give read traffic to the master. It won't happen often and in 3 years I only recall once or twice where we gave some read traffic to the master (usually during emergencies or where multiple slaves need to go for maintenance), but we should have the possibility.

  1. Does MediaWiki as used by WMF (wikimedia/rdbms, LBFactoryMulti) support giving the master db non-zero weight in terms of read queries intended for replicas?

From looking through the code, it looks like the answer is "Yes". Unless I missed something, this code is very complex.

I also note that it looks like the master is always considered "non-lagged", so it may be possible that all traffic might be directed to the master if the master has non-zero load and all replicas are too lagged.

If so, what would a configuration for that look like?

At the MediaWiki level, that would look like assigning a non-zero value in $wgLBFactoryConf['sectionLoads'] and/or $wgLBFactoryConf['groupLoadsBySection'] and/or $wgLBFactoryConf['groupLoadsByDB']. I don't know how that translates into dbctl.

For "external" clusters, I think it would still be done by assigning a non-zero value in $wgLBFactoryConf['externalLoads'].

  1. Is it meant to be valid to have a pooled replica with weight:0. If so, what is its current behaviour in MediaWiki?

I don't see any reason why not. The behavior would be to not direct normal traffic to it,[1] but it might still be considered for some operations that are intended to consider all servers.

[1]: There are a few exceptional cases where it might still be used, including explicit request for a particular replica, but I think those are generally manual or maintenance things.

Are zero-weight replicas waited upon by waitForReplication?

No. It skips replicas that have zero weight in all groups.

If another replicas have too much replication lag and MW is considering to enforce read-only mode, will it consider a zero-weight replica first?
If another replica is unavailable / failed to connect and we pick a random different one, does this sometimes pick the zero-weight replica?

No to both, at least for the normal code path for choosing a replica (LoadBalancer::pickReaderIndex()).

Are there other cases, aside from waitFor and "regular" queries, where MW will connect to zero-weight replicas to maybe gather certain state information or meta data etc.?

I see that zero-weight replicas will be queried for lag. Some code paths ignore the returned lag on zero-weight replicas, but others don't.

There may be other cases that I didn't spot.

Should we consider changing any of this?

Not a DBA, but I think we should throw away the Mediawiki load balancer -just the load balancer, not the rdbms, read-write split and other db-related stuff- (not because I don't think it is not valuable, all the oppsite, but because a) inherited complexity (do not take my word, one of the best engineers I know says above "this code is very complex.") b) our php shared information model is impossible to overcame by itself, and only works with external hacks in APC/memcache/etc. Substituting with an external health monitoring (an example would be an haproxy with a custom, unified health check, in high availability setup, and that is how I name it, but I don't have any technology in mind) with 3 "services": active master, 0-lag read-replica, and all read replicas (for each loadgroup). Benefits of this:

  • Site operators have larger control about what is being queried and where
  • Reduced complexity: code maintainers are only exposed to a simple "service", there has been in the past many bugs related to unintentional logic on failures
  • Health check are propagated to all instances once- if a host is unavailable, or has lag, there is not need to continuously check health on all app servers- just a centralized service + redundancy
  • Standardized load balancing. HAProxy is mature technology everybody understands (even if not perfect), simplifying the onboarding of both SREs and code maintainers
  • There is very few people that fully understands the logic fully, and many people that are fearful of even reading it
  • It may, at the same time, solve the refactoring needed for the cross-dc fix of the replication lag monitoring, which would need work anyway
  • It may allow in the future to use clustering technologies for extra automation/automatic failover
  • It is what most other people (big traffic sites) are doing (change HAproxy for 3rd party $loadbalancer)
  • It unifies the usage of HAProxy on all the cluster, rather than having to have special treatment for mediawiki
  • Automatic failover possible/easier
  • Extra visibility/killing threads with old config
  • Hot swap of replicas/masters if using an L7 proxy

Risks:

  • Any refactoring is inherently dangerous and costly
  • Less custom behavior allowed, may be less fitting
  • sunk cost fallacy: We have spent person-years fixing load balancing bugs, are we going to throw everything away?
  • There is no people available to take care of this (but there isn't either to fix loadbalancer bugs/features either!)- not undervaluing the work of many of you here, but it is the reality on all Technology teams

We can't entirely throw away the loadbalancer/lbfactory code, as it's also used to manage connections to different (conceptual) databases. For example, connecting to the current wiki's main DB versus one of the several ExternalStore DBs (two active, several others read-only) versus Commons or Wikidata's DB (rather than making an internal HTTP request) versus the shared DB for CentralAuth data. Even if haproxy can handle routing requests by the MySQL database being connected to (can it?), we'd still probably lose the ability to reuse the same opened connection for all s3 wikis instead of reconnecting a few hundred times as we cycle through them all.

We could, on the other hand, change the WMF configuration to have MediaWiki to connect to haproxy instances (one per cluster/section?) instead of to the individual MySQL servers directly. MediaWiki would see just one "replica". I don't know what (if anything) we'd lose by doing that. Ability to wait for all replicas? Ability to poll for maximum or median lag across all the replicas? Then we could potentially deprecate and remove the parts of the loadbalancer that handle multiple replicas and such.

  1. Does MediaWiki as used by WMF (wikimedia/rdbms, LBFactoryMulti) support giving the master db non-zero weight in terms of read queries intended for replicas?

From looking through the code, it looks like the answer is "Yes". Unless I missed something, this code is very complex.

As I said, I do recall giving weight to masters before just on some punctual cases.

I don't see any reason why not. The behavior would be to not direct normal traffic to it,[1] but it might still be considered for some operations that are intended to consider all servers.

[1]: There are a few exceptional cases where it might still be used, including explicit request for a particular replica, but I think those are generally manual or maintenance things.

We used to have vslow,dump with weight 0 just in case the dumps could create lag with the heavy queries.
I do see a reason to keep replicas with weight 0 for that specific reason, dumps is a good example, we have hosts that we still have pooled for an specific kind of traffic that might be ok with lag or some lag at least.

Are zero-weight replicas waited upon by waitForReplication?

No. It skips replicas that have zero weight in all groups.

What is the case if the replica has 0 weight for "normal" traffic but it is pooled on a given group, ie: API?

If another replicas have too much replication lag and MW is considering to enforce read-only mode, will it consider a zero-weight replica first?
If another replica is unavailable / failed to connect and we pick a random different one, does this sometimes pick the zero-weight replica?

No to both, at least for the normal code path for choosing a replica (LoadBalancer::pickReaderIndex()).

Are there other cases, aside from waitFor and "regular" queries, where MW will connect to zero-weight replicas to maybe gather certain state information or meta data etc.?

I see that zero-weight replicas will be queried for lag. Some code paths ignore the returned lag on zero-weight replicas, but others don't.

So even if they have 0 they'd still get checked even though they would get ignored if there's lag? Will that throw an error on the logs?
ie: what if a host is lagging or has mysql down? Will that be logged as a fatal?

we'd still probably lose the ability to reuse the same opened connection

We could reuse at middleware side -that is what external solutions are great for, complexitiy abstraction and thread reuse-, even if that means installing a proxy on every app server for optimization (or on every rack/row for performance). Again, don't take my suggestion as a literal "this is the only way to go", but as one of many suggestions to "change the model" and "externalize" some codepaths to already existing pieces of software, not maintained by us. We should be smart about what we develop and what we don't, being so limited by human workforce. I am not married to any particular solution (and I knew my proposal was going to be rejected), but we already have many pain points, and really not that many holistic solutions that don't involve "let me add more complexity to the already complex code". I am trying to help here to remove some responsibilities away from you so you can focus properly on the important things ("wiki logic" and architecture, not boring infrastructure code work - which I myself find interesting), to help you.

I see many complains on "we don't have time for this", "this is too complex", "nobody fully understands the implication", etc. I am just trying to suggest something that may (?) help with that. Please suggest a better option, there is probably a better way (I would be happy to hear it!). Your (fair) criticism are mostly due to "current model doesn't fit an alternative model", but the current model is the reason of lots of downsides as yourself conceded on your original comment. And I am not suggesting a "big rewrite"- this could be a phase approach: no code change + new infrastructure (exactly as you suggest in your last comment), and once it is battle tested, trim unnecessary logic. Offloading health checks to a centalized system. Or many other options!!!!! :-) Even 3rd party users would probably be happy to hear "if you need to scale, we are 100% compatible with X piece of software (whatever X is)". Proxies and other open source solutions didn't exist 20 or even 10 years ago. MySQL cluster technology was in its infancy back then. Things like galera cluster are now standard solutions to deploy databases over kubernetes. I even hear from some of you "we don't like MySQL for this task because X alternative does things transparently". I am suggesting to start looking around (not necessarily commit to anything particular) for that externalization of some (very specific) business logic that is not core to what Mediawiki does. People like you, TimS, Krinkle, Aaron are superheros for the work you are able to with these very complex issues, but I think that doesn't scale (just my opinion)- I would like to help share that burden with a radical approach.

We used to have vslow,dump with weight 0 just in case the dumps could create lag with the heavy queries.
I do see a reason to keep replicas with weight 0 for that specific reason, dumps is a good example, we have hosts that we still have pooled for an specific kind of traffic that might be ok with lag or some lag at least.

That's a little different from what I was talking about. We can have a replica that has zero weight for "general" queries, but non-zero for one of the special groups like 'vslow' or 'dump'.

Are zero-weight replicas waited upon by waitForReplication?

No. It skips replicas that have zero weight in all groups.

What is the case if the replica has 0 weight for "normal" traffic but it is pooled on a given group, ie: API?

It is waited for, because it has non-zero weight for some group.

So even if they have 0 they'd still get checked even though they would get ignored if there's lag? Will that throw an error on the logs?
ie: what if a host is lagging or has mysql down? Will that be logged as a fatal?

It will log an error-level message just as it would for a non-zero-weight replica. Similarly, a warning-level message will be logged if it exceeds the reporting threshold.

As far as uses:

  • LoadBalancer::getMaxLag() ignores hosts with zero weight in all groups when calculating the max lag.
  • LoadBalancer::getRandomNonLagged() (part of the logic for selecting a replica to read from) will log some debug-level messages if the zero-weight replica is excessively lagged.
  • The logic for making the Action API read-only for bot accounts (453d886) if a majority of replicas are too lagged does consider zero-weight replicas.
  • The logic behind https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= reports all replicas regardless of weight.
  • maintenance/lag.php and maintenance/getLagTimes.php print all replicas regardless of weight.

At the MediaWiki level, that would look like assigning a non-zero value in $wgLBFactoryConf['sectionLoads'] and/or $wgLBFactoryConf['groupLoadsBySection'] and/or $wgLBFactoryConf['groupLoadsByDB']. I don't know how that translates into dbctl.

In dbctl terms, this is fairly simple. By design, dbctl's output schema closely matches that of the Mediawiki datastructures: see https://wikitech.wikimedia.org/wiki/Dbctl#Schema

To set nonzero weight on a master, one could simply do (for example) dbctl instance db1083 set-weight 100 --section s1

For "external" clusters, I think it would still be done by assigning a non-zero value in $wgLBFactoryConf['externalLoads'].

I have a question about this, as we're configured this way in production right now. From db-eqiad.php:

$wmgOldExtTemplate = [
	'10.64.0.7'    => 1, # es1012, A2 11TB 128GB
	'10.64.32.185' => 1, # es1016, C2 11TB 128GB
	'10.64.48.115' => 1, # es1018, D1 11TB 128GB
];
'externalLoads' => [
	# Recompressed stores
	'rc1' => $wmgOldExtTemplate,
	# Former Ubuntu dual-purpose stores
	'cluster3' => $wmgOldExtTemplate,
	'cluster4' => $wmgOldExtTemplate,
...

In the case of $wmgOldExtTemplate aka es1, there's no master in MySQL terms -- AIUI they're independent, read-only instances that happen to contain the same data. But -- and please correct me if I am wrong! -- from Mediawiki's perspective, this says that 10.64.0.7/es1012 is the master of this section, and also happens to receive read weight.

So, in this case, does Mediawiki attempt any replication lag checking on the replicas? If so, what happens?

In the case of $wmgOldExtTemplate aka es1, there's no master in MySQL terms -- AIUI they're independent, read-only instances that happen to contain the same data. But -- and please correct me if I am wrong! -- from Mediawiki's perspective, this says that 10.64.0.7/es1012 is the master of this section, and also happens to receive read weight.

There's also this:

'templateOverridesByCluster' => [
    'rc1'       => [ 'is static' => true ],
    'cluster1'  => [ 'blobs table' => 'blobs_cluster1', 'is static' => true ],
    'cluster2'  => [ 'blobs table' => 'blobs_cluster2', 'is static' => true ],
    'cluster3'  => [ 'blobs table' => 'blobs_cluster3', 'is static' => true ],
    'cluster4'  => [ 'blobs table' => 'blobs_cluster4', 'is static' => true ],
    'cluster5'  => [ 'blobs table' => 'blobs_cluster5', 'is static' => true ],
    'cluster6'  => [ 'blobs table' => 'blobs_cluster6', 'is static' => true ],
    'cluster7'  => [ 'blobs table' => 'blobs_cluster7', 'is static' => true ],
    'cluster8'  => [ 'blobs table' => 'blobs_cluster8', 'is static' => true ],
    'cluster9'  => [ 'blobs table' => 'blobs_cluster9', 'is static' => true ],
    'cluster10' => [ 'blobs table' => 'blobs_cluster10', 'is static' => true ],
    'cluster20' => [ 'blobs table' => 'blobs_cluster20', 'is static' => true ],
    'cluster21' => [ 'blobs table' => 'blobs_cluster21', 'is static' => true ],
    'cluster22' => [ 'blobs table' => 'blobs_cluster22', 'is static' => true ],
    'cluster23' => [ 'blobs table' => 'blobs_cluster23', 'is static' => true ],
    'cluster24' => [ 'blobs table' => 'blobs_cluster24' ],
    'cluster25' => [ 'blobs table' => 'blobs_cluster25' ],
],

That "is static" flag tells MediaWiki that all the DB is a read-only clone rather than an actual replica.

So, in this case, does Mediawiki attempt any replication lag checking on the replicas? If so, what happens?

Thanks to the "is static" flag, no. Or more correctly, instead of actually fetching the lag it just assumes the lag is 0 for all the servers in that cluster.

It's not clear what happens if something tries to wait for the "master" on such a clone. Worst case it'll fetch the position from the clone at index 0 and then time out when waiting for the position on the other clone to change, it looks like.

Ideally I would love to see old and read-only hosts (es1, es2) not being treated as a replication topology by dbctl and rather as independent hosts (as the sort of are), so we can operate with the master without having to do a "master switchover" in dbctl (@CDanis I don't know how feasible is to have that, it is certainly not something urgent, do you want me to create a task to track that?).

Also, part of the discussion has covered the fact that we can do, indeed, set weight on the master if needed.
As per the pooled replicas with zero weight, from the comments here, looks like it is allowed and we have some reason to kept that behaviour.

Anything left on this task?
Thanks everyone!

Ideally I would love to see old and read-only hosts (es1, es2) not being treated as a replication topology by dbctl and rather as independent hosts (as the sort of are), so we can operate with the master without having to do a "master switchover" in dbctl (@CDanis I don't know how feasible is to have that, it is certainly not something urgent, do you want me to create a task to track that?).

Also, part of the discussion has covered the fact that we can do, indeed, set weight on the master if needed.
As per the pooled replicas with zero weight, from the comments here, looks like it is allowed and we have some reason to kept that behaviour.

Anything left on this task?
Thanks everyone!

I have created T245239: dbctl: treat read only ES hosts as standalone hosts to track the above and I am closing this task.
If someone feels the discussion isn't over, please re-open

Thanks everyone for the time and comments!