In order to make Wikibase more resilient, particularly on Wikidata.org, we need to investigate more ways to split up our SQL DB traffic load to allow things to be more resilient.
Background: Over the last years every now and again something goes wrong either on clients or on wikidata.org repo overloading the replicas and affecting areas that do not necessarily have any issues.
For example, a deployment on Wikidata.org deploys a bug in the wikidata editing API that does excessive lookups on replicas overloading them. This in turn makes all lookups to replicas for wikidata aross the cluster (including en.wikipedia.org) start failing, this in turn leads to fatal errors across all sites.
Any and all groups that are added to the DB calls via our code can be used by DBAs to separate traffic to the DB servers.
Potential directions
client queries separated from wikidata repo queries.
All wikidata clients make db calls to s8 (a database shard in WMF database cluster which contains Wikidata database) and share the same pool of replicas as wikidata.org repo code.
Having some separation between the actual dbs that these calls go to could be a good thing and should be considered.
This would also allow some pool of replicas to concentrate on caching and optimizing for queries that are mainly run from clients vs requests that are made from both client and repo always.
The possibility of adding even more groups here, such as separating high value sites such as en.wikipedia.org could also be evaluated.
Api queries vs non api queries
Is this already done in Wikibase code? Does MediaWiki mean this automatically happens? Needs to be checked
Special:EntityData is an example of a page that is in some ways an API, but that might use the regular db group anyway?
"terms" related queries vs non "terms" queries
@Ladsgroup claims that most terms queries come from client wikis AND are heavily cacheable, so this distinction can make some promising finding, and further split if needed
Other notes
Reading
- Example usage of groups in the MW db abstraction https://github.com/wikimedia/mediawiki/blob/5b5a4ff2acdac5997a966f937c485cd45f485ed7/includes/libs/rdbms/loadbalancer/LoadBalancer.php#L447
- One of the few example usages currently in WIkibase code:
- Some setting https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/172925414d408417184a36bdc55e37fffa7c6b94/repo/config/Wikibase.default.php#L287-L288
- Accessed here https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/2d2338312414c0d643823c4dd7d4d5fe6a16e284/repo/maintenance/DumpEntities.php#L243-L265
- Used at the db layer because of https://www.mediawiki.org/wiki/Manual:$wgDBDefaultGroup
- A more common usage pattern would be passing a group to the getConnection method Example in core, https://github.com/wikimedia/mediawiki/blob/e2b8d05850d85d860cf118d162ae5ec5c61419d9/maintenance/refreshLinks.php#L108
Related previous tickets
- T138208: Connections to all db servers for wikidata as wikiadmin from snapshot, terbium
- T138381: Allow DB group used by ChangeDispatcher to be configured
Expected outcomes & Acceptance Criteria
- Ideas listed in the ticket have been analysed for feasibility
- Other possible ways of dividing load have been considered
- Documentation of analyzing each of ideas are documented on this task
- The ways in which to use the group in all needed areas of code at the right time has been investigated and documented
- We can move forward from the results of this investigation and implement some db groups that will help the DBAs to split load in a way that will help prevent incidents.
- The invesigation is to be performed by at least 2 developers. They all will allocate up to 20 personhours (in total) for investigating and documenting the finding on this ticket **
