Page MenuHomePhabricator

RFC: Normalize MediaWiki link tables
Closed, ResolvedPublic

Description

  • Affected components: MediaWiki core:
    • Database schema..
    • ParserOutput, LinksUpdate, RefreshLinks job internals.
    • Special:WhatLinksHere, API:Backlinks etc.
  • Engineer for initial implementation: @Ladsgroup.
  • Code steward: Platform Engineering.

Motivation

This RFC is about these tables to be precise: imagelinks, pagelinks, templatelinks, categorylinks).

These link tables are among largest tables in any MediaWiki database. Here's an example from commonswiki:

wikiadmin@10.64.16.102(commonswiki)> SELECT table_schema, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024 / 1024), 2) `Size in GB`  FROM information_schema.TABLES where table_name like '%links' order by (data_length + index_length) desc;
+-----------------+------------------+------------+
| table_schema    | Table            | Size in GB |
+-----------------+------------------+------------+
| commonswiki     | templatelinks    |     301.45 |
| commonswiki     | externallinks    |     179.16 |
| commonswiki     | categorylinks    |     175.10 |
| commonswiki     | globalimagelinks |     136.36 |
| commonswiki     | pagelinks        |      84.96 |
| commonswiki     | imagelinks       |      55.47 |
| commonswiki     | iwlinks          |      42.81 |
| commonswiki     | langlinks        |       2.74 |
+-----------------+------------------+------------+

In total they are responsible for 0.8TB of data in Wikimedia Commons. Most of these data is redundant because target fields are repeated over and over again, for example "License_template_tag‏‎" as value of templatelinks.tl_target has been repeated 53 million times. This can be easily normalized through page table but the problem is that the actual content in mediawiki binds to the text. You use {{License template tag}} and not the page id of Template:License template tag in commonswiki.

That has several consequences:

  1. You can link to non-existing page and store those links (which empowers special pages of "Most wanted categories" and "Most wanted pages").
  2. If a page moves, the links stays to the old target which either gets redirected, repurposed to a completely different page or fixed by a bot. This can be a hassle, we have bots to fix moved categories but fixing that is outside of scope of this RFC.
Requirements
  • Any individual table ideally remains under 50GB. Right now some of these are over 300GB in size. (DBA)
  • The total size of an entire wiki's db tables must remain well under 4TB.

This means tables must either be reduced in size, or split.


Exploration

Proposed solution
I propose to add a new table with this structure:

CREATE TABLE /*_*/title (
    title_id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    title_namespace INT NOT NULL,
    title_title VARCHAR(255) BINARY NOT NULL, -- Naming is hard
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/title_name_title ON /*_*/title (title_namespace, title_title);

(ts prefix used to be used in tag_summary table but it doesn't exist anymore, I killed it with my bare hands)

Then a lookup service would be introduced similar to NameTableStore but with a different caching strategy (caching only the most read lookups and not all in memcached) to provide a lookup from string to id and vice versa.

Then for each of the mentioned *links table:

  • We add a new column to the table
  • Turn on the migration stage on "write both, read old"
  • Run the maintenance script to populate the new column
  • Turn the migration to "write both, read new"
  • After a period, reading only from the normalized column and dropping the non-normalized column.

(Repeating the whole procedure for each table, I did something similar but more complex with change_tags table)

That would save a lot of space and addressed significant scalability issues for Wikimedia Commons and English Wikipedia (these tables are around 60 GB in wikidatawiki, rather big but not the most pressing storage issue there).

Disclaimer: I do this on my volunteer capacity, this is not owned by WMDE or WMF. I would be very happy if any team would like to help with this.

Related Objects

Mentioned In
T300222: Implement normalizing MediaWiki link tables
T297754: CVE-2022-28204: Whatlinkshere of heavily used properties in wikidata can be easily utilized as a DDoS vector
T188730: Enable statement usage tracking on Commons and Co
T245965: Image links from #ifexist:Media:... are not being registered properly on tawiktionary
T287380: Decide on the future of DPL
T42626: Standardise type of timestamp database fields (MySQL)
T283461: Change datatype of cl_timestamp from timestamp to binary(14)
T278973: Add a way to differentiate transcluding a redirect and transcluding a redirect and its target
T278236: Add *_direct_link to imagelinks and templatelinks
T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata
T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys)
T258098: Purge unused watchlist rows
T263437: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table)
T196950: Pages do not have stable identifiers
T248244: SUPPORT: wikibase instance space consumption
T241053: Normalize globalimagelinks table
T224348: Global rename of Fiona B. → Fiona*: supervision needed
T161066: Merge pagelinks, templatelinks and imagelinks tables
T221449: Redesign querycache* tables
Mentioned Here
T297633: <Tech Initiative> Improving Databases in MediaWiki
T300222: Implement normalizing MediaWiki link tables
T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys)
T20493: RFC: Unify the various deletion systems
T196950: Pages do not have stable identifiers
T191231: RFC: Abstract schemas and schema changes
T161671: Compacting the revision table
T166733: Deploy refactored comment storage
T188327: Deploy refactored actor storage
T212972: Remove reference to text fields replaced by the comment table from WMCS views
T215466: Remove revision_comment_temp and revision_actor_temp
T198312: Set the WMF cluster to use the new MCR-only schema
T208776: RFC: Introduce PageIdentity to be used instead of WikiPage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For MCR, we are quite close (weeks) to T198312: Set the WMF cluster to use the new MCR-only schema, which will allow us to drop some columns from the revision and archive tables, namely:

  • revision.rev_text_id
  • revision.rev_content_model
  • revision.rev_content_format
  • archive.ar_text_id
  • archive.ar_content_model
  • archive.ar_content_format

For the progress on the actor and comment store migrations, you'll have to ask @Anomie. IIRC, these would render the following columns unused:

  • revision.rev_user
  • revision.rev_user_text
  • revision.rev_comment
  • archive.ar_user
  • archive.ar_user_text
  • archive.ar_comment

Not sure whether the plan is to introduce new columns for actor_id and comment_id, or whether the association will live in a separate table. IIRC, the code supports both options. Anomie will know.

I mentioned in TechCom that DB normalisation can sometimes cause performance problems with range queries. For example, if a page has a very large number of outbound links, and you want to page through them, sorted by title. In general, since an index can only aggregate fields in a single table, splitting a table may require the loss of such indexes, which can cause performance degradation. I'm not aware of any actual instances of this query pattern in the links tables, but Daniel asked me to make a note of this concern.

Is imagelinks local?

imagelinks records the local image link, so I'd say yes.

Specifically, if your wiki sentence-cases titles and you use a shared file repo which does not, would [[File:foo.png]] point to foo.png or Foo.png on the remote repo? I have no idea.

Probably Foo.png, as FileRepo deals with (local) Title objects so the uppercasing would have already happened before the shared repo even sees it.

  • So far these tables are up for normalization through the title table: pagelinks, templatelinks, imagelinks, categorylinks, querycache, qeurycachetwo for other tables, since there's not much duplication, I would not spend much time fixing them for now.

How does it sound?

If there are a significant number of queries that are joining xxxlinks→page on namespace+title, it may be worthwhile to add page.page_title_id as well so they don't have to go xxxlinks→title→page.

Eventually we probably should do the rest of the tables too. Lower priority is fine, but I'd not want to skip them completely.

Yeah, sorry, WhatLinksHere is the other direction, this would affect the links / images / templates / categories APIs (and probably index.php's action=info and edit previews). I guess all items coming from the contents of a single page is enough size limitation that this is not really concerning?

Action API links and templates can do the target namespace filtering on the PHP side in miser mode, as we already do for some other cases like namespace filtering in allrevisions or categorymembers. images and categories always return File- and Category-namespace pages, respectively, so they don't even have a target namespace filter.

Or, if we really want to preserve target namespace filtering, we could keep the namespace denormalized in the relevant links tables as we already do for e.g. pagelinks.pl_from_namespace (which denormalizes the namespace corresponding to pl_from).

Sure thing. Just, do you know timeline of changes for MCR and actor table refactorings so I make sure I don't step on any database's toes?

Probably @daniel and @Anomie can give some estimations on that

For actor, T188327 is the tracking task. All wikis are on write-both/read-new now. Assuming no more slow queries or other problems get reported, a rough guess is that we'd put write-new/read-new on test wikis on May 15, group 0 on May 20, group 1 on May 27 28, and group 2 on June 3. After that we'll have WMCS replica view changes (similar to T212972) and production schema changes to do.

BTW, the comment migration T166733: Deploy refactored comment storage has one more set of schema changes pending (I need to verify that the blocker in T212972#5057586 is now fixed).

Then there's also T215466: Remove revision_comment_temp and revision_actor_temp, which is blocked on T161671: Compacting the revision table, which is blocked on (most of) the above actor stuff plus the MCR stuff.

Since space was mentioned as a concern, I note that we can blank some no-longer-unused xx_comment fields from February–October 2018 that weren't already blanked during the comment migration (although hopefully we can just drop all except rev_comment soon). We can do the same for xx_user_text fields once we get to the point of making the WMCS changes for actor.

For the progress on the actor and comment store migrations, you'll have to ask @Anomie. IIRC, these would render the following columns unused:

More than that, there are similar fields becoming unused in several other tables as well.

Not sure whether the plan is to introduce new columns for actor_id and comment_id, or whether the association will live in a separate table. IIRC, the code supports both options. Anomie will know.

For all tables except revision, the new columns referencing actor_id and comment_id are already there and populated. For revision we currently have separate tables. The new referencing columns will be added to revision as part of T161671, then T215466 tracks migrating the data from the separate tables.

I mentioned in TechCom that DB normalisation can sometimes cause performance problems with range queries. For example, if a page has a very large number of outbound links, and you want to page through them, sorted by title.

On the other hand, if you're fine with sorting by the title ID rather than by namespace+title text there's no problem added. We already have special pages and such that sort by page_id for similar reasons.

Filtering by namespace or title-prefix could similarly be affected by normalization. Some of that was already discussed above.

One quick update. I worked a little bit on this in the hackathon. A POC is almost ready, I'm trying to finish it soon.
Also, by checking the code, it seems "templatelinks" table is the best option to start for two reasons: 1- the target column is only referenced around 20 times in the code, making the migration at code level easier than other tables (categorylinks is referenced around 80 times) 2- it's the biggest links table. (It's basically lowest and heaviest hanging fruit.)

Change 516059 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] [WIP/POC] Start TitleStore and TitleStoreMigration

https://gerrit.wikimedia.org/r/516059

I started the patch, It's way far from being ready.

I think it's ready to move forward, I don't see any open questions so far (correct me if I'm wrong)

@Ladsgroup we discussed this at the TechCom meeting. This proposal seems sensible at the database level, but there are two aspects that are not really clear: need and cost.

As for the need, more compact tables are certainly desirable, but how much of the problem is the current situation, really? Is there a pressing need to fox this now, do we expect this to become a problem within a year, two years? What drives this? Is the need the same for all the tables on all the projects? How do they related to each other? Perhaps @Marostegui or @jcrespo have answers top that.

The other aspect is cost. There is a lot of code touching these tables directly, and all of the code would need to be updated. Ideally, it would be updated in a way that would it make easier to change the schema later - that is, be introducing an abstraction layer. What would such an abstraction layer look like? How hard would it be to build? Would it be one service that covers all links tables, or one instance of the same class per links table, or a different service for each links table? How many places in the code will have to be changed to implement this proposal, for each links table?

You propsed to implement this for templatelinks first, since that table (on commonswiki) is the larges, with the most redundancy, and the cost is relatively small, since the table isn't accessed directly in too many places. That's probably a good plan, but we should have a good idea of what that would gain us, and how we'd want to go forward from there.

Manuel is on vacations ATM, I am glad to answer any questions, although DBAs need more concrete questions (e.g. we can answer how much space and iops would be saved for a particular wiki or table) as costs such as development time would be better calculated by the people involved on the Wikimedia-Rdbms code bits.

Personally, if you allow me to overstep beyond my duties and give my opinion as a very special kind of Mediawiki user, I would like to see this eventually done (in the way proposed or in a different way) as the current system is inefficient and would delay [as a good thing] the need for more drastic rearchitecturing like sharding or other. I of course understand it is not a simple change and has risks like queries becoming inefficient due to not being able to use covering indexes and other particularities, as well as other issues. However, I can also see (personal opinion) this is lower priority over other refactorings (like the ones related to the revision/slots tables and user); and for WMF infrastructure I would definitely like to see those finished first to understand if the main performance bottleneck would be here or somewhere else with higher priority.

If you allow me a humble suggestion, I would suggest, if @Ladsgroup wants to keep leading this, to make a POC (with no code changes except maybe a mockup of a single function/Special page and an alter of a single table of a large wiki on real data) to be able to answer @daniel's questions reliably. I can offer hw and a setup for that, I think, as well as support for metrics on performance optimization. It must be noted that I predict direct improvements (lower latency) may not be as many as other indirect one (lower disk footprint, faster backup and recovery, lower lag, and more resources available for other queries with the same hardware-consolidation) for this kind of optimization.

As for the need, more compact tables are certainly desirable, but how much of the problem is the current situation, really? Is there a pressing need to fox this now, do we expect this to become a problem within a year, two years? What drives this? Is the need the same for all the tables on all the projects? How do they related to each other? Perhaps @Marostegui or @jcrespo have answers top that.

Thanks for asking, as I mentioned before these tables have the worst effects on commonswiki where they almost take around 1TB of data. I can't give you an exact number on how much it will be reduced since I don't have a test node to test this on but my rough estimation is somewhere between half to one tenth. This indeed has lower priority than comment and actor migrations (which hopefully will be done soon) but I don't see anything database-related with higher priority so far (correct me if I'm wrong, wb_terms is also being tackled). Definitely @jcrespo can tell better. Abstract schema changes (T191231: RFC: Abstract schemas and schema changes) has higher priority for me too as they make doing this easier.

The other aspect is cost. There is a lot of code touching these tables directly, and all of the code would need to be updated. Ideally, it would be updated in a way that would it make easier to change the schema later - that is, be introducing an abstraction layer. What would such an abstraction layer look like? How hard would it be to build? Would it be one service that covers all links tables, or one instance of the same class per links table, or a different service for each links table? How many places in the code will have to be changed to implement this proposal, for each links table?

I made a POC that's very very basic and I'm pretty sure is not working but there should be a TitleStore class with some similarities to NameTableStore (plus a different way of caching) and one class like LinksTableStore (that takes name of the table and prefix as argument) and then we slowly clean places that hard-code querying these tables, We also already have LinkBatch class that I can take advantage of.

You propsed to implement this for templatelinks first, since that table (on commonswiki) is the larges, with the most redundancy, and the cost is relatively small, since the table isn't accessed directly in too many places. That's probably a good plan, but we should have a good idea of what that would gain us, and how we'd want to go forward from there.

If I can hw to test and give you some numbers, that would be great.

Krinkle renamed this task from Normalizing *links tables to Normalizing MediaWiki link tables.Jul 16 2019, 11:46 PM
Krinkle renamed this task from Normalizing MediaWiki link tables to Normalize MediaWiki link tables.
Krinkle added a project: Schema-change.

If I can hw to test and give you some numbers, that would be great.

I believe you have access to the test hosts we have? We can import tables and alter them as required if that'd help you with the tests

If I can hw to test and give you some numbers, that would be great.

I believe you have access to the test hosts we have? We can import tables and alter them as required if that'd help you with the tests

I don't have access to that node AFAIK, also this needs commons instead of wikidata.

If I can hw to test and give you some numbers, that would be great.

I believe you have access to the test hosts we have? We can import tables and alter them as required if that'd help you with the tests

I don't have access to that node AFAIK, also this needs commons instead of wikidata.

It also has commons there :-)
https://wikitech.wikimedia.org/wiki/MariaDB#Testing_servers

This may resolve T196950, as the title_id could serve as the "stable identifier" mentioned there.

@TK-999 Interesting correlation. Having to maintain fewer concepts would certainly be a win.

I am not sure, however, if that can be done safely in this case. When a page is renamed, should the page ID should follow the rename? Or should a title ID be repurposed?

This may resolve T196950, as the title_id could serve as the "stable identifier" mentioned there.

T196950 should probably be resolved by doing T20493, so the page_id is more stable.

When a page is renamed, should the page ID should follow the rename? Or should a title ID be repurposed?

The page ID follows the rename as it does now. If the title ID doesn't remain with the old title, we'd have to update all the links table entries that correspond to wikitext that still points to the old title.

This may resolve T196950, as the title_id could serve as the "stable identifier" mentioned there.

I'm afraid having title IDs doesn't resolve the problem of re-assigning page IDs during undeletion. The title ID is the same "identity" as the title itself, it's just more compact. It's not conceptually different. The crux of T196950 is partial undeletion, effectively resulting in a split of a page's identity. The only way to fix it is to manage deleted revisions by page ID instead of title. This means that there can at the same time be multiple deleted pages with the same title. But we should have this discussion on the other ticket :)

@jcrespo What would you say is the biggest or most important problem with our current links-tables design? E.g. is it the size it consumes on disk? If so, it would help if we could somehow quantify that with some numbers, thresholds or goal to aim for.

This information would in turn help TechCom present the RFC neutrally to a wider audience when inviting feedback on the proposed solution and/or other ideas that might solve the problem and/or reach the goal.

the size it consumes on disk

Not that: the size that creates an unnecessary large amount of iops, memory and cpu cycles, causing performance issues in terms of latency, locking and maintenance operations. The aim is not to reach 4TB for the total wiki disk space any time in the future, or the wikis will stop working completely. The aim is also to have a 15 minute recovery time for a database on a 1Gbit network. Also to have smaller tables (e.g. <50GB) that can be handled faster/in parallel when doing refactorings so we can serve developer's request faster.

If there was other way to split data per wiki and per table, this would be low priority.

Krinkle moved this task from Under discussion to P3: Explore on the TechCom-RFC board.

Thanks @jcrespo, I've updated the task to reflect these parameters.

Krinkle renamed this task from Normalize MediaWiki link tables to RFC: Normalize MediaWiki link tables.Apr 4 2020, 2:33 AM
Krinkle triaged this task as Medium priority.Jul 2 2020, 11:32 PM
Krinkle updated the task description. (Show Details)

@TK-999 Hi, I wonder if Fandom have perhaps done something in this area?

@TK-999 Hi, I wonder if Fandom have perhaps done something in this area?

@Krinkle Thanks for reaching out! Unfortunately we have not touched this area of MediaWiki (neither on 1.19 nor on 1.33). I am definitely following this conversation with interest, though :)

I just want to say that with the recent changes with revision table (MCR work). Now the biggest table of enwiki is pagelinks not revision table.

I just want to say that with the recent changes with revision table (MCR work). Now the biggest table of enwiki is pagelinks not revision table.

Oh, that is nice to know!

I am happy for the above, don't get me wrong!

But I may suggest a direction of priority of optimization, from a backup's perspective (and disaster recovery)- while I want this ticket happening at some point, I am currently more worried about the image table on commons:

enwiki (after compression):
-rw-rw---- 1 mysql mysql  62G Jan 12 08:59 externallinks.ibd
-rw-rw---- 1 mysql mysql  89G Jan 12 08:59 revision_actor_temp.ibd
-rw-rw---- 1 mysql mysql  99G Jan 12 08:59 templatelinks.ibd
-rw-rw---- 1 mysql mysql 155G Jan 12 08:59 pagelinks.ibd
-rw-rw---- 1 mysql mysql 157G Jan 12 08:59 revision.ibd

commonswiki (after compression):
-rw-rw---- 1 mysql mysql  90G Jan 12 08:58 globalimagelinks.ibd
-rw-rw---- 1 mysql mysql 135G Jan 12 08:58 categorylinks.ibd
-rw-rw---- 1 mysql mysql 179G Jan 12 08:58 externallinks.ibd
-rw-rw---- 1 mysql mysql 201G Jan 12 08:58 templatelinks.ibd
-rw-rw---- 1 mysql mysql 325G Jan 12 08:58 image.ibd

It is currently taking a lot to backup (and recover time-more than any other wiki in its entirety, including enwiki) as it is difficult to iterate over/split (string-based PK). Losing *link tables would be bad and there is room for optimization, but image-related tables have "canonical" data, so I would like to encourage people's next efforts towards pushing T28741. Even without changing anything on the actual file backend, there is lots of optimization opportunities, and lots of bugs coming from metadata loss in the last years.

Not voting against this, ofc, just providing my personal view of "what's the next big things/what's currently on fire" 0:-).

Is there an overarching task or a place folks can be kept in the loop as tables get close to being "too large" so that some sort of refactor will be necessary? Since usually the refactor itself will not fall to the dbas but they are the ones who will likely notice large tables or increases in growth the earliest.

Having just size of tables can lead to Goodhart's law. For example ES tables are massive (and being read a lot) but not a big scalability concern or image table is not that big but not having PK is contributing to scalability issues. It can have different aspects as well, like "taking backups is hard" to "we can't do any alter tables on this one anymore" to "it doesn't fit in InnoDB buffer pool" to "it's causing replication lag" and each require a different solution and some being contradictory (and require trade-off).

I'm definitely pro having better visibility of our databases but it's really to hard to build properly.

Ladsgroup claimed this task.

I'm boldly closing this as the discussion for the normalization has happened and it is now officially in roadmap of data persistence team (and other teams to tackle T297633: <Tech Initiative> Improving Databases in MediaWiki). There wasn't any objections against it in the first place since its creation (just discussing the details which got hammered throughout the years).

We are tracking the implementation in T300222: Implement normalizing MediaWiki link tables