Page MenuHomePhabricator

External LBs should not be exposed to developers
Closed, ResolvedPublic

Description

Handling of external databases (vs. core databases) in mw needs rethinking and redesign:

  • What does external in here means? Local db of another wiki can be considered external. In here the code means x1/x2 and esX only which is confusing to devs.
  • It's an infrastructure-specific design (and a good one to scale wikis) that developers don't need to care (beside avoiding joining with core tables)
  • This has led to many extensions dealing with external storage through configuration (each extension has a dedicated configuration for it) and "creative solutions": T314908
  • Its testablity is questionable at best: T314908
  • It can't be properly wired in the updater/installer
  • We are planning to migrate tables out of core tables (like globalimageslinks) and for new usages force devs to use extension clusters as much as possible (=to allow better scalability) and making this easier is important in reaching such goal.

Proposed solution:

  • For tables that are in x1 but in wiki's db (e.g. they are in "arzwiki" database in x1). Introduce concept of "virtual domain" (h/t Tgr in T314908). An attribute to be added to extension.json (like "DatabaseVirtualDomains")
  • Add a new core config: Something like:
$wgVirtualDomainsMapping => [
  'urlshortener' => [ 'cluster' => 'extension1', 'db' => 'wikishared' ]
  'growthexperiments' => [ 'cluster' => 'extension1' ],
  'echo-shared' => [ 'cluster' => 'extension1', 'db' => 'wikishared' ],
  'echo-local' => [ 'cluster' => 'extension1' ],
  // ...
];

Which would mean:

  • LBF::getPrimaryDatabase( 'urlshortener' ) would internally translates into $this->getExternalLB( 'extension1' )->getConnection( DB_PRIMARY, [], 'wikishared' )
  • LBF::getPrimaryDatabase( 'growthexperiments' ) would internally translates into $this->getExternalLB( 'extension1' )->getConnection( DB_PRIMARY, [], false )
  • If the virtual domain is not in $wgVirtualDomainsMapping keys, it goes to MainLB. Think of localhost.

This solution at least in paper allows us to get rid of a lot code duplications (just search for "getExternalLB" in extensions), a lot of configurations, enables better testablaity (possibly marking some domains as external in tests and giving them a different db breaking any attempt to join), and possibly wiring it up properly in DatabaseUpdater.

Open questions:

  • The default behavior would be if the domain doesn't exist in externalDomains, it translates to being a database name (e.g. commonswiki) in mainLB. That is generally fine except for case of "virtual domains" like growthexperiments and echo if it's not set up in config (local setups, tests, 3rd parties, etc.).
  • Suggested solution: configuration of 'externalDomains' => [ 'growthexperiments' => [] ] would mean LBF::getPrimaryDatabase( 'growthexperiments' )will do mainLB but without $domain being passed.
  • or have a dedicated parameter like 'virtualDomains' => [ 'growthexperiments', 'echo_per_wiki' ] that would hint how they need to be handled. I like this one better
  • Regardless of which direction, extensions need to have a to alter that default value (in prod it's not important, this is for tests and local setups and third parties, etc.). I guess a hook would work here? A new attribute in extension.json seems like an overkill.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This wouldn't handle the case when something needs to read a virtual domain, but for a wiki different from the current one. Not sure if that's a real use case, probably not. There are some examples of doing that for the main domain (e.g. cross-wiki SpamBlacklist lists), an extension needing to do it for its own tables seems unlikely.

Requiring the virtual domains to be declared in some way seems like a good approach. That way, typos can be detected early, instead of causing surprise in production. Everything declared but not configured could still go to the main DB, which is probably what we want for tests. An extension.json attribute seems reasonable to me, both simpler and faster than a hook.

Probably something to be done at a later stage, but how would this interact with the installer? Should I be able to configure in the installer UI where I want my Echo tables to go, for example? (The updater would just work as it is, I think.)

Krinkle renamed this task from External LBs should not be exposed to develoeprs to External LBs should not be exposed to developers.Feb 27 2023, 6:41 PM

I wonder if we should introduce a general new attribute in extension.json like "Database" and then have keys such as "VirtualDomains" and "SchemaUpdates" (T237839#8268173) and such.

I support making external clusters implicit. It feels intuitive and is effectively the same as what we do already with cross-wiki connections, in that getPrimaryDatabase('enwiki') is also transparently routed LBFactoryMulti to the relevant one of the s1-s8 clusters. It fits to do the same for databases that happen to be hosted on a dedicated cluster ("external") such as the case for centralauth and urlshortener.

I'm hestitant to do the same for individual tables that exist on a per-wiki basis databases, such as with growthexperiments and echo. These destinations are ambigious and imho not analogous to the same concept. It seems counter-intuitive for getPrimaryDatabase( 'growthexperiments' ) to not have a single destination (as the name "primary" would imply) but rather route to one of 900 possible destination databases based on the "current" wiki. On top of that, in the particular case of echo we actually happen to utilize both of these two concepts at the same time. There exists a central x1.echo database, but there exist also x1.<wiki_id>.echo_* tables. We could introduce virtual name as intermediary for one of these two, but it would remain a leaky abstraction. We can't hide that these are two different concepts. I suggest recognising these in some way.

The "external" concept was introduced for ExternalStore at first, which uses per-wiki database, and later applied to Echo, which does as well. Perhaps we can phase out use of "external" terminlogy for the case of a single central database, and use a method like getExternalDatabase($virtualCluster, $domain = false) as analogy to LBFactory::getExternalLB for these cases.

See also: T316299: Clarify/merge the concept of external clusters and main section clusters in LBFactory/LoadBalancer

@aaron wrote at T316299:

The difference is still not well explained in comments and I suspect that a lot of code could be merged/simplified, possibly with the restriction the section and external cluster names in config must be collectively unique (seems reasonable).

Ladsgroup added a project: DBA.

First okr of this Q

Change 963291 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Introduce concept of virtual domains and mapping to ext cluster

https://gerrit.wikimedia.org/r/963291

Change 963293 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/UrlShortener@master] USe the new virtual domain mapping system

https://gerrit.wikimedia.org/r/963293

Change 963291 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Introduce concept of virtual domains and mapping to ext cluster

https://gerrit.wikimedia.org/r/963291

Change 963293 merged by jenkins-bot:

[mediawiki/extensions/UrlShortener@master] Use the new virtual domain mapping system

https://gerrit.wikimedia.org/r/963293

Change 963837 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Set virtual domain mapping for url shortener

https://gerrit.wikimedia.org/r/963837

Change 963837 merged by jenkins-bot:

[operations/mediawiki-config@master] Set virtual domain mapping for url shortener

https://gerrit.wikimedia.org/r/963837

Mentioned in SAL (#wikimedia-operations) [2023-10-09T09:55:06Z] <ladsgroup@deploy2002> Started scap: Backport for [[gerrit:963837|Set virtual domain mapping for url shortener (T330590)]]

Mentioned in SAL (#wikimedia-operations) [2023-10-09T10:03:55Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:963837|Set virtual domain mapping for url shortener (T330590)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-10-09T10:10:42Z] <ladsgroup@deploy2002> Finished scap: Backport for [[gerrit:963837|Set virtual domain mapping for url shortener (T330590)]] (duration: 15m 35s)

Ladsgroup moved this task from In progress to Done on the DBA board.

A bit late to think of this but should we require some special prefix for virtual domains (say @urlshortener) to ensure they are non-overlapping with normal domains?

A bit late to think of this but should we require some special prefix for virtual domains (say @urlshortener) to ensure they are non-overlapping with normal domains?

I was thinking of the same but instead of implicit prefixes making it hard what does it mean, I suggest going with "virtual-" instead, like virtual-urlshortener

OTOH, the chance of collision is quite low and we could slowly migrate any non-wiki db to virtual domain (like centralauth, in the mapping you simply don't set the cluster) and that allows everything to be virtual domain without the need for prefix (unless you suspect you will have a wiki named centralauth).

I prefer the latter tbh.

Maybe not centralauth, but for example having a wiki called flow or echo doesn't seem far-fetched. The virtual domain name is controlled by the extension developer, the actual domain names are controlled by the site administrator, so without some kind of explicit namespacing, not much conflict avoidance can be done.

I think a character prefix is more readable and getPrimaryDatabase / getReplicaDatabase would be a natural place to document it; but a text prefix like virtual- would be fine too.

Maybe not centralauth, but for example having a wiki called flow or echo doesn't seem far-fetched. The virtual domain name is controlled by the extension developer, the actual domain names are controlled by the site administrator, so without some kind of explicit namespacing, not much conflict avoidance can be done.

I think a character prefix is more readable and getPrimaryDatabase / getReplicaDatabase would be a natural place to document it; but a text prefix like virtual- would be fine too.

I like this idea. Though, maybe extension developers could have a bit of choice in terms of what prefix used, for example, either "local-", "global-", or "dataset-". Seeing "global-centralauth" and "local-flow" would help to suggest the the developer intended scope of the dataset relative to the current wiki and the wiki farm (e.g. per-wiki vs shared-by-wikis). Some like "dataset-" would be for cases where either make sense.

Change 966633 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: document virtualDomains/virtualDomainsMapping in ILBFactory

https://gerrit.wikimedia.org/r/966633

I like this idea. Though, maybe extension developers could have a bit of choice in terms of what prefix used, for example, either "local-", "global-", or "dataset-". Seeing "global-centralauth" and "local-flow" would help to suggest the the developer intended scope of the dataset relative to the current wiki and the wiki farm (e.g. per-wiki vs shared-by-wikis). Some like "dataset-" would be for cases where either make sense.

A naming convention that differentiates shared and non-shared tables would definitely be very helpful.

Maybe a dedicated ticket? It's a bit of bikeshed honestly on what would be the standard and I don't care much as long as there is a standard.

Change 966633 merged by jenkins-bot:

[mediawiki/core@master] rdbms: document virtualDomains/virtualDomainsMapping in ILBFactory

https://gerrit.wikimedia.org/r/966633