Page MenuHomePhabricator

Investigate a different db load groups for wikidata / wikibase
Closed, ResolvedPublic

Description

In order to make Wikibase more resilient, particularly on Wikidata.org, we need to investigate more ways to split up our SQL DB traffic load to allow things to be more resilient.

Background: Over the last years every now and again something goes wrong either on clients or on wikidata.org repo overloading the replicas and affecting areas that do not necessarily have any issues.

For example, a deployment on Wikidata.org deploys a bug in the wikidata editing API that does excessive lookups on replicas overloading them. This in turn makes all lookups to replicas for wikidata aross the cluster (including en.wikipedia.org) start failing, this in turn leads to fatal errors across all sites.

Any and all groups that are added to the DB calls via our code can be used by DBAs to separate traffic to the DB servers.

Potential directions

client queries separated from wikidata repo queries.

All wikidata clients make db calls to s8 (a database shard in WMF database cluster which contains Wikidata database) and share the same pool of replicas as wikidata.org repo code.
Having some separation between the actual dbs that these calls go to could be a good thing and should be considered.
This would also allow some pool of replicas to concentrate on caching and optimizing for queries that are mainly run from clients vs requests that are made from both client and repo always.

The possibility of adding even more groups here, such as separating high value sites such as en.wikipedia.org could also be evaluated.

Api queries vs non api queries

Is this already done in Wikibase code? Does MediaWiki mean this automatically happens? Needs to be checked

Special:EntityData is an example of a page that is in some ways an API, but that might use the regular db group anyway?

"terms" related queries vs non "terms" queries

@Ladsgroup claims that most terms queries come from client wikis AND are heavily cacheable, so this distinction can make some promising finding, and further split if needed

Other notes

Reading

Related previous tickets

Expected outcomes & Acceptance Criteria

  • Ideas listed in the ticket have been analysed for feasibility
  • Other possible ways of dividing load have been considered
  • Documentation of analyzing each of ideas are documented on this task
  • The ways in which to use the group in all needed areas of code at the right time has been investigated and documented
  • We can move forward from the results of this investigation and implement some db groups that will help the DBAs to split load in a way that will help prevent incidents.
  • The invesigation is to be performed by at least 2 developers. They all will allocate up to 20 personhours (in total) for investigating and documenting the finding on this ticket **

Event Timeline

Addshore renamed this task from Investigate a different db load group for wikidata client queries compared with wikidata repo queries. to Investigate a different db load groups for wikidata.Feb 28 2020, 8:23 AM
Addshore renamed this task from Investigate a different db load groups for wikidata to Investigate a different db load groups for wikidata / wikibase.
Addshore added a project: DBA.
Addshore updated the task description. (Show Details)
Addshore added a project: User-Addshore.

I just checked with the DBAs to confirm that there is currently "no per-domain logs or stats of db queries- those would be on application side, dbs don't know if request urls"
That would likely have to be done in the application. At the db level this can only really be separated per db, user or query pattern.

The "Investigate" part of this ticket title was indeed to try to make some data driven decisions rather than plow on with adding groups that might make no difference.

Thus I am also reaching out to the Performance-Team on IRC about this topic to see what they would advise.
Will report back when I have something.

Addshore triaged this task as Medium priority.

Change 593335 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] WIP DNM DRAFT, potential way to track the DB usage for wikibase

https://gerrit.wikimedia.org/r/593335

Marostegui added a subscriber: Marostegui.

Removing the DBA tag as there is nothing for us at the moment. Staying subscribed to the task though, in case I am needed
Feel free to add the tag back when needed though.

Change 593335 abandoned by Addshore:
[mediawiki/extensions/Wikibase@master] WIP DNM DRAFT, potential way to track the DB usage for wikibase

Reason:

https://gerrit.wikimedia.org/r/593335

@Addshore and @Ladsgroup to share pointers to the existing ways of getting the data possibly relevant for deciding on the desired approach

What we have done here:
We went through the performance schema of replicas in codfw since the switchover (September 2, 17 days ago) to see what is the nature of queries going to replicas. Total latency of selects for s8 replicas is 57712491 seconds (roughly 100 weeks). The load broken down by the nature of the query is this:

term_store: 56.38%
rest_of_wikibase: 6.63%
rc: 0.24%
revision: 22.36%
The rest of mediawiki: 9.08%
outside mediawiki: 5.31%

This can be further broken down, let us know if you want to more, like what are the top resource-consuming queries or what is "the rest of mediawiki", ...

We also put up a patch that adds statsd metrics for the term store, to see how much of the pressure on s8 is coming from clients or replicas.

s8 is coming from clients or replicas.

clients or replicas? I don't quite follow this bit?

s8 is coming from clients or replicas.

clients or replicas? I don't quite follow this bit?

s/replicas/repos/

I wrote replicas so many times my brain got fried. Sorry.

Top queries that put more than 1% of the total load on the replicas:

Termbox:
23690016.0 (41.05%) SELECT `wbtl_id` , `wbtl_type_id` , `wbxl_language` , `wbx_text` , `wbit_item_id` FROM `wbt_term_in_lang` JOIN `wbt_text_in_lang` ON ( ( `wbtl_text_in_lang_id` = `wbxl_id` ) ) JOIN `wbt_text` ON ( ( `wbxl_text_id` = `wbx_id` ) ) JOIN `wbt_item_terms` ON ( ( `wbit_term_in_lang_id` = `wbtl_id` ) ) WHERE `wbit_item_id` IN (...) AND `wbtl_type_id` = ? AND `wbxl_language` = ? 
6320160.0 (10.95%) SELECT `wbtl_id` , `wbtl_type_id` , `wbxl_language` , `wbx_text` , `wbit_item_id` FROM `wbt_term_in_lang` JOIN `wbt_text_in_lang` ON ( ( `wbtl_text_in_lang_id` = `wbxl_id` ) ) JOIN `wbt_text` ON ( ( `wbxl_text_id` = `wbx_id` ) ) JOIN `wbt_item_terms` ON ( ( `wbit_term_in_lang_id` = `wbtl_id` ) ) WHERE `wbit_item_id` = ? AND `wbtl_type_id` = ? AND `wbxl_language` = ? 
2288736.0 (3.966%) SELECT `wbtl_id` , `wbtl_type_id` , `wbxl_language` , `wbx_text` , `wbit_item_id` FROM `wbt_term_in_lang` JOIN `wbt_text_in_lang` ON ( ( `wbtl_text_in_lang_id` = `wbxl_id` ) ) JOIN `wbt_text` ON ( ( `wbxl_text_id` = `wbx_id` ) ) JOIN `wbt_item_terms` ON ( ( `wbit_term_in_lang_id` = `wbtl_id` ) ) WHERE `wbit_item_id` = ? AND `wbtl_type_id` = ? AND `wbxl_language` IN (...)

Rest of wikibase:
2736288.0 (4.741%) SELECT `ips_site_id` , `ips_site_page` , `ips_item_id` FROM `wb_items_per_site` WHERE `ips_item_id` = ? AND `ips_site_id` = ?

Queries involving revision table:
5337792.0 (9.249%) SELECT `rev_id` , `rev_page` , `rev_timestamp` , `rev_minor_edit` , `rev_deleted` , `rev_len` , `rev_parent_id` , `rev_sha1` , `comment_rev_comment` . `comment_text` AS `rev_comment_text` , `comment_rev_comment` . `comment_data` AS `rev_comment_data` , `comment_rev_comment` . `comment_id` AS `rev_comment_cid` , `actor_rev_user` . `actor_user` AS `rev_user` , `actor_rev_user` . `actor_name` AS `rev_user_text` , `temp_rev_user` . `revactor_actor` AS `rev_actor` , `page_namespace` , `page_title` , `page_id` , `page_latest` , `page_is_redirect` , `page_len` , `user_name` FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ( ( `temp_rev_comment` . `revcomment_rev` = `rev_id` ) ) JOIN `comment` `comment_rev_comment` ON ( ( `comment_rev_comment` . `comment_id` = `temp_rev_comment` . `revcomment_comment_id` ) ) JOIN `revision_actor_temp` `temp_rev_user` ON ( ( `temp_rev_user` . `revactor_rev` = `rev_id` ) ) JOIN `actor` `actor_rev_user` ON ( ( `actor_rev_user` . `actor_id` = 
5085504.0 (8.812%) SELECT `rev_id` , `rev_timestamp` , `page_latest` , `page_is_redirect` , `page_title` FROM `page` INNER JOIN `revision` ON ( ( `page_latest` = `rev_id` ) ) WHERE ( `page_title` = ? AND `page_namespace` = ? ) 
1920672.0 (3.328%) SELECT `page_namespace` , `page_title` , `page_id` , `page_latest` , `page_is_redirect` , `page_len` FROM `revision` JOIN `page` ON ( ( `page_id` = `rev_page` ) ) WHERE `rev_id` = ? LIMIT ?

Rest of core:
3577824.0 (6.199%) SELECT `slot_revision_id` , `slot_content_id` , `slot_origin` , `slot_role_id` , `content_size` , `content_sha1` , `content_address` , `content_model` FROM `slots` JOIN `content` ON ( ( `slot_content_id` = `content_id` ) ) WHERE `slot_revision_id` = ?

Misc:
2718288.0 (4.71%) SELECT `ts` FROM `heartbeat` . `heartbeat` WHERE `shard` = ? AND `datacenter` = ? ORDER BY `ts` DESC LIMIT ?

These queries on total are responsible for 90% of the load on the replicas.

Some logical conclusions so far:

  • Splitting based on rc table doesn't make sense, let's just drop that.
  • 9% of the total load on the db is just figuring out what is the most recent revision id of an entity. I assume we can cache this better (out of scope of this ticket, just interesting)
  • 56% of the total load is term store (and it's heavily cached, this is actually 1% of the actual amount of read that reaches to the database, this is scary).
  • I'm working on to find out what's the load for term store coming from, it's very likely client but let's wait until we can get the actual data
  • wb_items_per_site is being read pretty heavily, I assume in client when it's getting sitelink of an item. To be confirmed with stastd
  • 10% is Special:Contributions, that's weird. To be checked in more depth
  • 6% of the load is just trying to figure out how to load a revision (so it could connect to ES and load it)

While I'm here, I did a quick check on matter of pressure of writes on master. And boy I'm not disappointed:

  • 6% of all of the write pressure is just this:
95904.00000000001 (6.519%) UPDATE `user` SET `user_editcount` = `user_editcount` + ? WHERE `user_id` = ? AND ( `user_editcount` IS NOT NULL )

In other words, updating user edit count takes 6% of all of writes on master. This is twice as the write pressure of term store!!! I will file a bug for that.

  • Change dispatching costs us 5% write pressure. Maybe we can optimize it a bit?:
76608.0 (5.208%) UPDATE `wb_changes_dispatch` SET `chd_site` = ? , `chd_db` = ? , `chd_seen` = ? , `chd_touched` = ? , `chd_lock` = ? WHERE `chd_site` = ?

As matter of read pressure on master, term store is 2% (nice), 11% is wb_changes_subscription (maybe needs some improvements) and the rest are core trying to find things to save an edit. Improving read pressure on master is important because it's not scalable (you can't buy more masters).

I'm sorry for going a bit off-topic, this is extremely interesting.

Thank you for all this work Amir, this is really interesting. I am curious about a few things, and if you'd have the time, can you explain further?

Splitting based on rc table doesn't make sense, let's just drop that.

Why doesn't make sense?

56% of the total load is term store (and it's heavily cached, this is actually 1% of the actual amount of read that reaches to the database, this is scary).
I'm working on to find out what's the load for term store coming from, it's very likely client but let's wait until we can get the actual data

I guess those two are related, but, what do you mean with the client?

6% of all of the write pressure is just this:
95904.00000000001 (6.519%) UPDATE user SET user_editcount = user_editcount + ? WHERE user_id = ? AND ( user_editcount IS NOT NULL )
In other words, updating user edit count takes 6% of all of writes on master. This is twice as the write pressure of term store!!! I will file a bug for that.

This is quite crazy indeed. Can't this be done thru a queue or something?

Thank you for this interesting work!

Thank you for all this work Amir, this is really interesting. I am curious about a few things, and if you'd have the time, can you explain further?

Splitting based on rc table doesn't make sense, let's just drop that.

Why doesn't make sense?

I mean mostly in the context of having a dedicated node for it (as it's only 1% of the read load, it would make sense if we had 100 replicas) but maybe redirecting most queries to one node (that would also handle lots of other things) would make sense too.

56% of the total load is term store (and it's heavily cached, this is actually 1% of the actual amount of read that reaches to the database, this is scary).
I'm working on to find out what's the load for term store coming from, it's very likely client but let's wait until we can get the actual data

I guess those two are related, but, what do you mean with the client?

Client is any wiki that's not repo (wikidata/commons), like English Wikipedia. Actually during the migration of term store, we found out most of the reads on s8 comes from clients and not wikidata itself (as they read wikidata's database directly).

6% of all of the write pressure is just this:
95904.00000000001 (6.519%) UPDATE user SET user_editcount = user_editcount + ? WHERE user_id = ? AND ( user_editcount IS NOT NULL )
In other words, updating user edit count takes 6% of all of writes on master. This is twice as the write pressure of term store!!! I will file a bug for that.

This is quite crazy indeed. Can't this be done thru a queue or something?

Yup, We have a similar thing for updating site_stats table. I assume we can reuse that.

Thank you for all this work Amir, this is really interesting. I am curious about a few things, and if you'd have the time, can you explain further?

Splitting based on rc table doesn't make sense, let's just drop that.

Why doesn't make sense?

I mean mostly in the context of having a dedicated node for it (as it's only 1% of the read load, it would make sense if we had 100 replicas) but maybe redirecting most queries to one node (that would also handle lots of other things) would make sense too.

Gotcha. We also have to keep in mind that 1 host, in reality means 2 (for HA) and in both DCs.

56% of the total load is term store (and it's heavily cached, this is actually 1% of the actual amount of read that reaches to the database, this is scary).
I'm working on to find out what's the load for term store coming from, it's very likely client but let's wait until we can get the actual data

I guess those two are related, but, what do you mean with the client?

Client is any wiki that's not repo (wikidata/commons), like English Wikipedia. Actually during the migration of term store, we found out most of the reads on s8 comes from clients and not wikidata itself (as they read wikidata's database directly).

Ah, understood! Thank you :)

6% of all of the write pressure is just this:
95904.00000000001 (6.519%) UPDATE user SET user_editcount = user_editcount + ? WHERE user_id = ? AND ( user_editcount IS NOT NULL )
In other words, updating user edit count takes 6% of all of writes on master. This is twice as the write pressure of term store!!! I will file a bug for that.

This is quite crazy indeed. Can't this be done thru a queue or something?

Yup, We have a similar thing for updating site_stats table. I assume we can reuse that.

That'd be nice, +1

Thank you again

On average 60% of term store queries come from client wikis:

image.png (652×1 px, 193 KB)

Okay, Let me give on overview on what has been worked on so far, this is basically a summary of the previous comments and reshaped:

  • Queries to term store are basically half of the load.
    • Given that it's actually 1% of queries that actually reach the db after being handled by memcached and APCu, I repeat, this is scary.
  • At least half of our load is from client wikis but probably no more than 70% (getting the exact number is impossible)
  • Finding out how much of queries are coming from the API appservers (like ratio of open connections) can be done by DBAs I assume, I can't find any other way to find the API/non-API context for queries.
    • Also Special:EntityData doesn't matter in that regard because most of the heavy lifting is to ES and not s8
  • There's a trade-off, if we want to achieve resiliency (e.g. by separating portions of queries to a dedicated node), it would decrease scalability and flexibility of the infra, I recommend getting DBAs opinion on the decision too.
  • Load should not be the only factor in building groups, the type of queries and tables it queries is important, for example, recentchanges is only 1% of the load (not number of queries, the total latency) on s8, it doesn't make sense to have a dedicated node (per DC) for it but having them localized to one node makes sense so the queries take advantage of MySQL caching (of course, the node should not be only handling rc queries otherwise it would be a huge waste of resource).

My personal preference in this is grouping term store due to the fact that it's going to be cached better in MySQL and the load is not negligible and it's pretty straightforward to implement (e.g. client/repo is not easy to implement given that most of them hit lib/ instead or lots of them actually coming from mediawiki core and we have no control over destination of queries originating from there). My other idea about this is that in the mid-term planning (like in 5 years), all of term store can be moved to its dedicated section to drastically increase the scalability of wikidata's database in matter of read/write/storage, while such thing can't be achieved via repo/client split. OTOH, the resilience that repo/client can bring won't be achieved because client and repo queries won't be separated and issues originating from one would affect the other OTOOH (I have three hands), most of the load on s8 comes from mw core and lib code (and not repo/client code) thus bugs and issues in lib will bring down both anyway even if separate them.

I'm looking into this, trying to understand it and get an idea whether I'm able to contribute something and thus should pick it up. But I'm currently still confused and have some questions:

lots of them actually coming from mediawiki core and we have no control over destination of queries originating from there

But that means that this is not coming from within Wikibase code, right? Is it referring to requests that originate from the Mediawiki core instance that runs Wikidata.org, like displaying the main page or running a normal wiki search?

we found out most of the reads on s8 comes from clients and not wikidata itself (as they read wikidata's database directly).

I thought s8 is the replica of the "wikidata's database"?

e.g. client/repo is not easy to implement given that most of them hit lib/ instead

That doesn't look like a feature to me. Shouldn't code in lib/ be handed a db connection to use instead creating its own? Though that might be a refactoring task that exceeds the scope of this ticket. But maybe not? 🤔

about term store:
If 56% of the load is term store in total and that load is split roughly 60/40 between clients and repo, then that means that ca. 33% of load comes from client term store requests and 23% from repo term store requests. How do term store requests from commons fit into this?

Can client wikis on one hand and repo/wikidata.org on the other hand function without the term store data, maybe by gracefully falling back to QIDs everywhere? If yes, then groups splitting termstore/not termstore seem like they would make sense. But if everything goes down if the term store is down, then a split clients/repo would seem more useful. Does that make sense?

about other types of requests:
Also, there seem to be some requests that maybe don't make up much load but still (sadly) lead to fatal errors if they don't succeed, like sitelinks. (Judging from everything going down when that table was deleted.)
We should probably make sure they are grouped into the "critical cluster" as well?

Hope these answers help.

lots of them actually coming from mediawiki core and we have no control over destination of queries originating from there

But that means that this is not coming from within Wikibase code, right? Is it referring to requests that originate from the Mediawiki core instance that runs Wikidata.org, like displaying the main page or running a normal wiki search?

Take for example, the code that loads an entity to look up some data, it has to hit revision lookup, and revision lookup in core depending on cache hit or miss, might query the database (or not). Also normal core operations like checking recentchanges are also also included in core queries

we found out most of the reads on s8 comes from clients and not wikidata itself (as they read wikidata's database directly).

I thought s8 is the replica of the "wikidata's database"?

It is the replica of wikidata's database but it mostly read by requests going to client wikis, like if you need to reparse the article of Alan Turing in Basque Wikipedia (because you made an edit in it or parser cache expired), it actually hits s8 quite heavily.

e.g. client/repo is not easy to implement given that most of them hit lib/ instead

That doesn't look like a feature to me. Shouldn't code in lib/ be handed a db connection to use instead creating its own? Though that might be a refactoring task that exceeds the scope of this ticket. But maybe not? 🤔

That's a good approach but it would require (I assume) rather large refactorings.

about term store:
If 56% of the load is term store in total and that load is split roughly 60/40 between clients and repo, then that means that ca. 33% of load comes from client term store requests and 23% from repo term store requests. How do term store requests from commons fit into this?

Commons is considered as part of repo, even client requests in wikidata are also considered as part of repo since there is no easy for lib/ code to determine if the request is coming from client code or repo code. So I assume the actual client read is actually much higher (but OTOH, it doesn't matter that much since issues in wikidata client reads affect the repo behavior too as they are the same wiki)

Can client wikis on one hand and repo/wikidata.org on the other hand function without the term store data, maybe by gracefully falling back to QIDs everywhere? If yes, then groups splitting termstore/not termstore seem like they would make sense. But if everything goes down if the term store is down, then a split clients/repo would seem more useful. Does that make sense?

In some cases yes but in some cases, they fallback to read from ES (= loading the whole entity) which in turn would bring down everything but fixing this IMO should not be too big.

about other types of requests:
Also, there seem to be some requests that maybe don't make up much load but still (sadly) lead to fatal errors if they don't succeed, like sitelinks. (Judging from everything going down when that table was deleted.)
We should probably make sure they are grouped into the "critical cluster" as well?

hmm, good point, I think its load was around 6% which wasn't big but not small either. The table you're mentioning is wb_items_per_site and is really similar to term store but it's just not big enough to have its own section, I assume we can group it with term store and call it "wikidata secondary data store" or something like that.

  • Finding out how much of queries are coming from the API appservers (like ratio of open connections) can be done by DBAs I assume, I can't find any other way to find the API/non-API context for queries.

This is not super easy to do though, some things can be tried, but it is not straightforward.
Ie: we can remove the DBs that are serving API from the main traffic section and see how many QPS they do.

  • There's a trade-off, if we want to achieve resiliency (e.g. by separating portions of queries to a dedicated node), it would decrease scalability and flexibility of the infra, I recommend getting DBAs opinion on the decision too.

Agreed, we need to carefully think about the tradeoffs. These are some items that imply having a new group:

  • We need at least 2 hosts per DC (for HA)
  • We definitely cannot have different schema on those
  • We need to be able to serve their load with the rest of hosts (in case both hosts go down even, or in case we need to switch them around for maintenance). We cannot make them snowflakes
  • Load should not be the only factor in building groups, the type of queries and tables it queries is important, for example, recentchanges is only 1% of the load (not number of queries, the total latency) on s8, it doesn't make sense to have a dedicated node (per DC) for it but having them localized to one node makes sense so the queries take advantage of MySQL caching (of course, the node should not be only handling rc queries otherwise it would be a huge waste of resource).

+1 indeed!

So, to make a baby step of moving this ticket forward: Based on the discussion so far I could see two possible ways to maybe split things:

Option 1: One extra load group for Term requests only
Reasoning

Term requests are prone to see spikes if there are caching issues or other problems. This has been discussed above by noting that over 50% of our total load is coming from term requests and that is only 1% of the total Term requests (i.e. the ones that did not hit cache). Other sources of problems could be some term requests that are still entirely uncached, like the issue of T263999. Seeing this high potential of problems, it would be nice to isolate them and reduce the likelihood of everything coming down if anything is going awry for terms.

What it would look like
  • term requests in one load group
  • everything else in a second default load group
still to do / find out
  • ensure we do not fall back to ES or at least add some smart way (load shedding?) of not overloading the main replica db with term requests failing over
  • ensure all consumers querying the db for terms have proper fallback behavior if the database is not available (not caching bad responses, not throwing a fatal error)
Option 2: Split out both repo and client requests from the default requests
Reasoning

trying to shield the Wikipedias/Client from repo-specific problems on Wikidata. In order to work, requests coming from the client may need to make db requests that are not attributable to the client, such as loading an entity revision through some core service. The same might be true for requests coming from the repo. The split proposed here could improve stability as a bug that affects only direct repo queries would maybe not necessarily impact the default and client load group but only its own.

What it would look like
  • known client requests in one load group
  • known repo requests in another load group
  • everything else, incl. core requests in a third load group
still to do / find out
  • share of term requests that are to be attributed to client/repo (see discussion in T262923)
  • What would this mean for "client" requests coming from Wikidata or Commons?
  • Is it possible to make the load group dynamic based on which client wiki the request is coming from? I.e. can we have a load group just for client requests from enwiki?
  • the result of T262924 seems to indicate that we should be able to differentiate repo/client requests in lib/
Option 3: Something something API Requests?

I still understand them and their impact, origin and consequences way too little to make a case for splitting along these lines.


Do the above make any sense? Should we continue investigating them further and flesh them out more or are we still missing crucial high-level considerations?

Please note that I'm not a database/Ops expert, not even close. So it is entirely possible, even very likely that I got things wrong and lacked awareness of important aspects. Please review the ideas above carefully and skeptically.

I feel this now needs more feedback again before we can think about further steps.

Looking at the ACs I think we have some open questions still

  • How feasible is it to introduce a "client" and "repo" db load group and have it used in the right places in our code? / how could that be tackled.
  • How feasible is it to introduce a "terms" db load group and have it used in the right places in our code? / how could that be tackled.
  • It is possible to do these simultaneously?

These are my opinions only, if anyone wants to double check, that'd be great.

Looking at the ACs I think we have some open questions still

  • How feasible is it to introduce a "client" and "repo" db load group and have it used in the right places in our code? / how could that be tackled.

That's not easily feasible as lots of queries come from lib/ data-access/ or core itself (e.g. when looking up entity info using page info lookup) and it's pretty hard to find out what's the context this wiki is being ran.

  • How feasible is it to introduce a "terms" db load group and have it used in the right places in our code? / how could that be tackled.

This one is pretty easy, since all ->select calls are contained in terms codebase

  • It is possible to do these simultaneously?

There are several ways:

  • Having three groups: client, repo, terms. The first two are complicated but not too much (if we don't want to count the cases involving core) but the third part is easy. The biggest concern would be that if we split the term store reads, there's not much left to split further specially if we don't count queries originating from core, it'll be basically site_per_item reads (for interwiki links)
  • Having four groups, client, repo, repo-terms, client-terms: Not too harder the above but I don't see the point. It wouldn't take advantage of cache locality.

Looking at the ACs I think we have some open questions still

  • How feasible is it to introduce a "client" and "repo" db load group and have it used in the right places in our code? / how could that be tackled.

That's not easily feasible as lots of queries come from lib/ data-access/ or core itself (e.g. when looking up entity info using page info lookup) and it's pretty hard to find out what's the context this wiki is being ran.

I guess all this needs though is an abstraction to the DB layer that deals with this and adds the groups where needed?
Even lib services ultimately are created by either client or repo?

  • Having four groups, client, repo, repo-terms, client-terms: Not too harder the above but I don't see the point. It wouldn't take advantage of cache locality.

I don't think we would benefit from having groups such as repo-terms or 2 groups joined up for example as this removes flexibility from where DBAs can end up routing these? perhaps?
Might be worth checking with them?

The final remaining thing to figure out here would be the AC that reads:

  • We can move forward from the results of this investigation and implement some db groups that will help the DBAs to split load in a way that will help prevent incidents.

So we need to answer which direction we want to go in. Seemingly either terms, or client vs repo, or both?
CC @Ladsgroup @Michael @WMDE-leszek Should this move back to todo? or do we want to try to come up with this final answer not on the camp (but I guess here in stalled/waiting nothing will happen)

Looking at the ACs I think we have some open questions still

  • How feasible is it to introduce a "client" and "repo" db load group and have it used in the right places in our code? / how could that be tackled.

That's not easily feasible as lots of queries come from lib/ data-access/ or core itself (e.g. when looking up entity info using page info lookup) and it's pretty hard to find out what's the context this wiki is being ran.

I guess all this needs though is an abstraction to the DB layer that deals with this and adds the groups where needed?

Yup but LB/LB factory is being used directly in lots of places, and if we want to properly inject the abstraction, is going to be really complicated (and pretty big). Some of the core ones doesn't even accept the LB/LBFactory. I'm not saying it's not doable. I'm saying it'll be too much work for too little gain IMHO.

Even lib services ultimately are created by either client or repo?

  • Having four groups, client, repo, repo-terms, client-terms: Not too harder the above but I don't see the point. It wouldn't take advantage of cache locality.

I don't think we would benefit from having groups such as repo-terms or 2 groups joined up for example as this removes flexibility from where DBAs can end up routing these? perhaps?
Might be worth checking with them?

Yeah, I'm not sure splitting by both groups would be a good idea (unless we grow really really big.)

The final remaining thing to figure out here would be the AC that reads:

  • We can move forward from the results of this investigation and implement some db groups that will help the DBAs to split load in a way that will help prevent incidents.

So we need to answer which direction we want to go in. Seemingly either terms, or client vs repo, or both?

I'd say we should collectively decide this. Maybe a meeting would be good?

CC @Ladsgroup @Michael @WMDE-leszek Should this move back to todo? or do we want to try to come up with this final answer not on the camp (but I guess here in stalled/waiting nothing will happen)

Addshore claimed this task.

Markins as resolved as the investigation part is done and some implementation to make an initial split has happened T281453