Page MenuHomePhabricator

FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster
Open, MediumPublic

Description

This is part of annual plan for FY 25-26. Commons database has grown to an extremely large size (2TB) and we have no choice but to split it. The most cost-benefit efficient way is to move the links tables to a dedicated cluster (x4) which would cut its size to half.

These tables must be moved to x4:

  • linktarget
  • externallinks
  • pagelinks
  • templatelinks
  • categorylinks
  • collation
  • imagelinks
  • globalimagelinks
  • iwlinks
  • existencelinks

Maybe:

  • langlinks

This can be done via introducing a new core virtual domain (virtual-links) and then migrating queries. Any query that join with page or other tables must be split to be two queries hitting different connections and if not possible, it should be done in a different way (e.g. T309738: Move MediaWiki QueryPages computation to Hadoop) or gets disabled in commons.

We need to set up back ups and wikireplicas too.

For now, we should assume page and redirect tables will exist in both databases. Depending on how many cases directly join and not easy to split, we could consider having those tables being shared.

Related Objects

Event Timeline

Ladsgroup triaged this task as Medium priority.Jul 4 2025, 12:05 PM
Ladsgroup moved this task from Inbox to Epic - Database on the Data-Persistence board.
Aklapper renamed this task from WE 6.4.1: Move links tables of commons to a dedicated cluster to FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster.Jul 4 2025, 1:35 PM
  • For globalimagelinks we may consider make it functionally separate (using a second virtual domain, do not use core linktarget, etc) given T241053#10001833.
  • We may consider also moving links table in Wikidata. pagelinks is currently the second largest table after terms being moved to a dedicated cluster.

Doesn't this require communication towards the community similar to how the x3 change was announced for the Wikidata terms tables ?

Doesn't this require communication towards the community similar to how the x3 change was announced for the Wikidata terms tables ?

Yup but we are at least several months away from the split actually happening. There are a lot of complex issues that need to be solved e.g. how to replicate the page table from s4 to x4 (and the consolidation mechanism) and so on.

I noticed in a few commits that there is not just virtual-links but also a more narrow virtual-categorylinks. Is there more detail somehere on how that is meant to work exactly? The current plan describes a dedicated cluster for the link tables, which is easy to reason about and there's the task listing which tables are available to query/join there (with page+redirect earmarked as special TBD).

A domain like virtual-categorylinks suggests more decoupling, but I couldn't find any discussion or docs on what it is meant to be used for or what tables one we assume to exist there? Some of the queries using this join against linktarget, for example.

I noticed in a few commits that there is not just virtual-links but also a more narrow virtual-categorylinks. Is there more detail somehere on how that is meant to work exactly? The current plan describes a dedicated cluster for the link tables, which is easy to reason about and there's the task listing which tables are available to query/join there (with page+redirect earmarked as special TBD).

A domain like virtual-categorylinks suggests more decoupling, but I couldn't find any discussion or docs on what it is meant to be used for or what tables one we assume to exist there? Some of the queries using this join against linktarget, for example.

The idea was to mostly ease the migration of read queries and make sure they are correctly moved (specially for the test setup). Once we are done, all will be consolidated into virtual-links.