Page MenuHomePhabricator

Normalizing *links tables
Open, Needs TriagePublic

Description

This is a RFC for normalizing target columns of *links tables (These tables to be precise: imagelinks, pagelinks, templatelinks, categorylinks).

Problem
These *links tables are among largest tables in mediawiki installations. Here's an example from commonswiki:

wikiadmin@10.64.16.102(commonswiki)> SELECT table_schema, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024 / 1024), 2) `Size in GB`  FROM information_schema.TABLES where table_name like '%links' order by (data_length + index_length) desc;
+-----------------+------------------+------------+
| table_schema    | Table            | Size in GB |
+-----------------+------------------+------------+
| commonswiki     | templatelinks    |     301.45 |
| commonswiki     | externallinks    |     179.16 |
| commonswiki     | categorylinks    |     175.10 |
| commonswiki     | globalimagelinks |     136.36 |
| commonswiki     | pagelinks        |      84.96 |
| commonswiki     | imagelinks       |      55.47 |
| commonswiki     | iwlinks          |      42.81 |
| commonswiki     | langlinks        |       2.74 |
| testcommonswiki | pagelinks        |       0.00 |
| testcommonswiki | templatelinks    |       0.00 |
| testcommonswiki | externallinks    |       0.00 |
| testcommonswiki | iwlinks          |       0.00 |
| testcommonswiki | imagelinks       |       0.00 |
| testcommonswiki | categorylinks    |       0.00 |
| testcommonswiki | globalimagelinks |       0.00 |
| testcommonswiki | langlinks        |       0.00 |
+-----------------+------------------+------------+
16 rows in set (0.01 sec)

In total they are responsible for 0.8TB of data in Wikimedia Commons. Most of these data is redundant because target fields are repeated over and over again, for example "License_template_tag‏‎" as value of templatelinks.tl_target has been repeated 53 million times. This can be easily normalized through page table but the problem is that the actual content in mediawiki binds to the text. You use {{License template tag}} and notthe page id of Template:License template tag in commonswiki. That has several consequences: 1- You can link to non-existing page and store those links (which empowers special pages of "Most wanted categories" and "Most wanted pages") 2- If a page moves, the links stays to the old target which either gets redirected, repurposed to a completely different page or fixed by a bot. This can be a hassle, we have bots to fix moved categories but fixing that is outside of scope of this RFC.

Proposed solution
I propose to add a new table with this structure:

CREATE TABLE /*_*/title (
    title_id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    title_namespace INT NOT NULL,
    title_title VARCHAR(255) BINARY NOT NULL, -- Naming is hard
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/title_title ON /*_*/title (title_namespace, title_title);

(ts prefix used to be used in tag_summary table but it doesn't exist anymore, I killed it with my bare hands)

Then a lookup service would be introduced similar to NameTableStore but with a different caching strategy (caching only the most read lookups and not all in memcached) to provide a lookup from string to id and vice versa.

Then for each of the mentioned *links table:

  • We add a new column to the table
  • Turn on the migration stage on "write both, read old"
  • Run the maintenance script to populate the new column
  • Turn the migration to "write both, read new"
  • After a period, reading only from the normalized column and dropping the non-normalized column.

(Repeating the whole procedure for each table, I did something similar but more complex with change_tags table)

That would save a lot of space and addressed significant scalability issues for Wikimedia Commons and English Wikipedia (these tables are around 60 GB in wikidatawiki, rather big but not the most pressing storage issue there).

Disclaimer: I do this on my volunteer capacity, this is not owned by WMDE or WMF. I would be very happy if any team would like to help with this.

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptApr 30 2019, 8:32 PM
Tgr awarded a token.Apr 30 2019, 11:49 PM

Keep it simple and call it title?

There is some related discussion in T208776: Introduce PageIdentity to be used instead of Title.

Tgr added a comment.Apr 30 2019, 11:58 PM

One thing maybe worth considering is that pagelinks and templatelinks always refer to local titles (for now), imagelinks IIRC can refer to titles in the shared image repo (Commons), and langlinks always refers to other wikis. Which means links cannot be reliably denormalized (is Foo and foo the same title?) and you have "titles" that are actually namespace + title, which locally might be an invalid title.
(How much are langlinks used these days, anyway? The sitelinks tracked via Wikidata are not duplicated there, right?)

Anomie added a comment.May 1 2019, 4:20 PM

@jcrespo and I have been discussing this same sort of idea, in a "we should do that someday once all the revision schema changing is over" sort of way. We don't seem to have ever gotten around to filing a Phab task for it though. Thanks for filing one!

The general proposal I had in mind was more like

CREATE TABLE /*_*/title (
    title_id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,

    title_namespace INT NOT NULL,
    title_title VARCHAR(255) BINARY NOT NULL, -- Naming is hard
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/title_title ON /*_*/title (title_namespace, title_title);

The inclusion of the namespace in there makes it map a bit more directly to the concept of "titles" in MediaWiki. There wouldn't be too much additional duplication of title-strings since the only really guaranteed duplicates are the subject-space and talk-space versions of a page.

We'd then use that in (doing a quick search in core) page (but would probably have to keep the denormalized page_namespace and page_title for query sorting), archive, pagelinks, templatelinks, possibly imagelinks, possibly categorylinks, recentchanges, watchlist, querycache, logging, job, redirect, querycachetwo, and protected_titles.

Migration would likely go much as you stated, although we might have to reconsider certain tables if they turn out to be too big to even add the extra column in a sane amount of time (as revision is currently, for example).

As @Tgr noted, this wouldn't work too well for externallinks, langlinks, or iwlinks as those aren't storing local titles.

Tgr added a comment.May 1 2019, 8:53 PM

Is imagelinks local? Specifically, if your wiki sentence-cases titles and you use a shared file repo which does not, would [[File:foo.png]] point to foo.png or Foo.png on the remote repo? I have no idea.

+1 to add namespace to the "title" table.

Advantages:

  • This will reduce space on page, *links and others, increasing caching opportunities and reducing write size

Disadvantages:

  • Prediction of bad query performance, not necessarily inherent to normalization, but it may require query changes or additional columns (e.g. page_exist, collation, others, etc.).

The titles need to be unmutable (insert only) and include titles for pages that do not exist but are referenced (e.g. red links). Sadly this wouldn't allow for trivial category renames.

externallinks, langlinks, or iwlinks could be thought of being normalized in a different way (e.g. we could standardize as lang id + title id, needed by several extensions), but it would be out of scope here. I know some extensions referencing cross-databases (x1, commons, wikibase) that may love to have an id instead of full title reference.
There are, I believe, also extensions that ambiguously use pages when they should use titles and/or the other way round.

would probably have to keep the denormalized page_namespace and page_title for query sorting

We definitely would need to keep the ns, the title could be joined, but the issue, as usual, is the join + sort which usually bit us. However, the main concern here are the *links tables, not page. Every time a page, teplate or category is updated, sometimes GB of text has to be rewritten. The amount of rows changed would not be changed, but the amount of bytes per record will (think changing a 32456 id to 32457 (8 bytes) instead of 'Annex: Films created or produced by Belgian-Russian filmakers in the 17nth century by toe size' (95 bytes). Note the difference may not seem huge at first, but one should think the same title is linked by [0...millions of pages].

Is imagelinks local?

Probably not, but it doesn't matter much, "locally used images" is still a local concept (I believe), even if they are retrieved from Commons. Not everthing has to migrate at the same time, though- I would start with 1 or a few tables at a time to avoid large duplication. We could even start creating the title table without migrating any other table just for querycache.

  • I'm convinced to use namespace. I can change the RFC to reflect the new schema. The only thing that bothers me is title_title column. Can we bikeshed a little for a better name? like title_value?
  • I consider any *links table that uses non-title as its target (external link, etc.) as outside of scope of this RFC. The issue with langlinks and interwikilinks is that they are very unlikely to be duplicates so there's no point in normalizing them.
    • So far these tables are up for normalization through the title table: pagelinks, templatelinks, imagelinks, categorylinks, querycache, qeurycachetwo for other tables, since there's not much duplication, I would not spend much time fixing them for now.

How does it sound?

How does it sound?

Looks great. A couple of petitions:

  • Let's spend some extra time on the planning, as many iterations as we need so implementation is later as fast as possible
  • for WMF deployment: Let's try to avoid overlapping with other refactorings- actor and MCR ones I think are still on the fly, and wb_terms too wants to start soon. These duplications are quite costly in space and iops, and we are running lower and lower per server with each additonal one. This gets us (DBAs) very nervous, as all these were mostly intended as space saving measures. I 100% understand they are nor precisely easy or fast, but having so many refactorings in implementation phase is uncomfortable from a resource perspective.
Tgr added a comment.May 2 2019, 9:18 PM

+1 to add namespace to the "title" table.

If we do that, how would namespace filters in things like Special:WhatLinksHere work? Wouldn't we end up having denormalize them in pretty much every table that uses titles?

I know some extensions referencing cross-databases (x1, commons, wikibase) that may love to have an id instead of full title reference.

Would be nice to have a conversation at some point about the future of lists. It would be great for a number of use case to have a title table that's shared between all wikis (just like we sort of have it for users), but that would mean not being able to join on it. In general it feels like we are pushing against the limits of MySQL with lists / usage tracking and should consider using something else. (But yeah, off scope here.)

There are, I believe, also extensions that ambiguously use pages when they should use titles and/or the other way round.

There are a few things that use page ids as a means of getting short URLs (/w/index.php?curid=12345) when arguably they should be referencing titles instead (which would mean slightly different handling of moves, deletion etc) but there is no concise way of referencing those. E.g. image description page links generated in MediaViewer.
Then again, those use cases are now probably obviated by UrlShortener.

Probably not, but it doesn't matter much, "locally used images" is still a local concept (I believe), even if they are retrieved from Commons.

It matters in that the wiki knows how to normalize local titles, e.g. foo and Foo are the same title, unless you are on a wiktionary. With a foreign title that assumption might not hold.

If we do that, how would namespace filters in things like Special:WhatLinksHere work? Wouldn't we end up having denormalize them in pretty much every table that uses titles?

Maybe I'm missing something but the plan is to normalize the target (and not the source) in pagelinks table. That wouldn't have any effect in Special:WhatLinksHere namespace filtering as this special page is traverse of the pagelinks table (you are getting sources and you have the target)

Tgr added a comment.May 3 2019, 7:32 PM

Yeah, sorry, WhatLinksHere is the other direction, this would affect the links / images / templates / categories APIs (and probably index.php's action=info and edit previews). I guess all items coming from the contents of a single page is enough size limitation that this is not really concerning?

Yeah, sorry, WhatLinksHere is the other direction, this would affect the links / images / templates / categories APIs (and probably index.php's action=info and edit previews). I guess all items coming from the contents of a single page is enough size limitation that this is not really concerning?

if there's proper index in place, it should not be a concern (famous last words)

  • for WMF deployment: Let's try to avoid overlapping with other refactorings- actor and MCR ones I think are still on the fly, and wb_terms too wants to start soon. These duplications are quite costly in space and iops, and we are running lower and lower per server with each additonal one. This gets us (DBAs) very nervous, as all these were mostly intended as space saving measures. I 100% understand they are nor precisely easy or fast, but having so many refactorings in implementation phase is uncomfortable from a resource perspective.

Sure thing. Just, do you know timeline of changes for MCR and actor table refactorings so I make sure I don't step on any database's toes?

Ladsgroup updated the task description. (Show Details)May 6 2019, 4:26 PM
  • for WMF deployment: Let's try to avoid overlapping with other refactorings- actor and MCR ones I think are still on the fly, and wb_terms too wants to start soon. These duplications are quite costly in space and iops, and we are running lower and lower per server with each additonal one. This gets us (DBAs) very nervous, as all these were mostly intended as space saving measures. I 100% understand they are nor precisely easy or fast, but having so many refactorings in implementation phase is uncomfortable from a resource perspective.

Sure thing. Just, do you know timeline of changes for MCR and actor table refactorings so I make sure I don't step on any database's toes?

Probably @daniel and @Anomie can give some estimations on that

For MCR, we are quite close (weeks) to T198312: Set the WMF cluster to use the new MCR-only schema, which will allow us to drop some columns from the revision and archive tables, namely:

  • revision.rev_text_id
  • revision.rev_content_model
  • revision.rev_content_format
  • archive.ar_text_id
  • archive.ar_content_model
  • archive.ar_content_format

For the progress on the actor and comment store migrations, you'll have to ask @Anomie. IIRC, these would render the following columns unused:

  • revision.rev_user
  • revision.rev_user_text
  • revision.rev_comment
  • archive.ar_user
  • archive.ar_user_text
  • archive.ar_comment

Not sure whether the plan is to introduce new columns for actor_id and comment_id, or whether the association will live in a separate table. IIRC, the code supports both options. Anomie will know.

daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.May 9 2019, 5:39 AM

I mentioned in TechCom that DB normalisation can sometimes cause performance problems with range queries. For example, if a page has a very large number of outbound links, and you want to page through them, sorted by title. In general, since an index can only aggregate fields in a single table, splitting a table may require the loss of such indexes, which can cause performance degradation. I'm not aware of any actual instances of this query pattern in the links tables, but Daniel asked me to make a note of this concern.

Agabi10 added a subscriber: Agabi10.May 9 2019, 8:35 AM

Is imagelinks local?

imagelinks records the local image link, so I'd say yes.

Specifically, if your wiki sentence-cases titles and you use a shared file repo which does not, would [[File:foo.png]] point to foo.png or Foo.png on the remote repo? I have no idea.

Probably Foo.png, as FileRepo deals with (local) Title objects so the uppercasing would have already happened before the shared repo even sees it.

  • So far these tables are up for normalization through the title table: pagelinks, templatelinks, imagelinks, categorylinks, querycache, qeurycachetwo for other tables, since there's not much duplication, I would not spend much time fixing them for now. How does it sound?

If there are a significant number of queries that are joining xxxlinks→page on namespace+title, it may be worthwhile to add page.page_title_id as well so they don't have to go xxxlinks→title→page.

Eventually we probably should do the rest of the tables too. Lower priority is fine, but I'd not want to skip them completely.

Yeah, sorry, WhatLinksHere is the other direction, this would affect the links / images / templates / categories APIs (and probably index.php's action=info and edit previews). I guess all items coming from the contents of a single page is enough size limitation that this is not really concerning?

Action API links and templates can do the target namespace filtering on the PHP side in miser mode, as we already do for some other cases like namespace filtering in allrevisions or categorymembers. images and categories always return File- and Category-namespace pages, respectively, so they don't even have a target namespace filter.

Or, if we really want to preserve target namespace filtering, we could keep the namespace denormalized in the relevant links tables as we already do for e.g. pagelinks.pl_from_namespace (which denormalizes the namespace corresponding to pl_from).

Sure thing. Just, do you know timeline of changes for MCR and actor table refactorings so I make sure I don't step on any database's toes?

Probably @daniel and @Anomie can give some estimations on that

For actor, T188327 is the tracking task. All wikis are on write-both/read-new now. Assuming no more slow queries or other problems get reported, a rough guess is that we'd put write-new/read-new on test wikis on May 15, group 0 on May 20, group 1 on May 27 28, and group 2 on June 3. After that we'll have WMCS replica view changes (similar to T212972) and production schema changes to do.

BTW, the comment migration T166733: Deploy refactored comment storage has one more set of schema changes pending (I need to verify that the blocker in T212972#5057586 is now fixed).

Then there's also T215466: Remove revision_comment_temp and revision_actor_temp, which is blocked on T161671: Compacting the revision table, which is blocked on (most of) the above actor stuff plus the MCR stuff.

Since space was mentioned as a concern, I note that we can blank some no-longer-unused xx_comment fields from February–October 2018 that weren't already blanked during the comment migration (although hopefully we can just drop all except rev_comment soon). We can do the same for xx_user_text fields once we get to the point of making the WMCS changes for actor.

For the progress on the actor and comment store migrations, you'll have to ask @Anomie. IIRC, these would render the following columns unused:

More than that, there are similar fields becoming unused in several other tables as well.

Not sure whether the plan is to introduce new columns for actor_id and comment_id, or whether the association will live in a separate table. IIRC, the code supports both options. Anomie will know.

For all tables except revision, the new columns referencing actor_id and comment_id are already there and populated. For revision we currently have separate tables. The new referencing columns will be added to revision as part of T161671, then T215466 tracks migrating the data from the separate tables.

I mentioned in TechCom that DB normalisation can sometimes cause performance problems with range queries. For example, if a page has a very large number of outbound links, and you want to page through them, sorted by title.

On the other hand, if you're fine with sorting by the title ID rather than by namespace+title text there's no problem added. We already have special pages and such that sort by page_id for similar reasons.

Filtering by namespace or title-prefix could similarly be affected by normalization. Some of that was already discussed above.

One quick update. I worked a little bit on this in the hackathon. A POC is almost ready, I'm trying to finish it soon.
Also, by checking the code, it seems "templatelinks" table is the best option to start for two reasons: 1- the target column is only referenced around 20 times in the code, making the migration at code level easier than other tables (categorylinks is referenced around 80 times) 2- it's the biggest links table. (It's basically lowest and heaviest hanging fruit.)

Change 516059 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] [WIP/POC] Start TitleStore and TitleStoreMigration

https://gerrit.wikimedia.org/r/516059

I started the patch, It's way far from being ready.