Page MenuHomePhabricator

Compacting the revision table
Open, MediumPublic

Description

This is the task for the schema change project documented on the wiki at User:Brion VIBBER/Compacting the revision table round 2. Part of the description is copied below.

Per ongoing discussion in ArchCom and at WikiDev17 about performance, future requirements, and future-proofing for table size it's proposed to do a major overhaul of the revision table, combining the following improvements:

  • Normalization of frequently duplicated data to separate tables, reducing the dupe strings to integer keys
  • Separation of content-specific from general-revision metadata to support:
    • Multi-content revisions allowing for storing of multiple content blobs per revision -- not related to compaction, but of great interest for structured data additions planned for multimedia and articles
  • general reduction in revision table width / on-disk size will make schema changes easier in future
  • trying to avoid inconsistencies in live index deployments
    • ideally all indexes should fit on all servers, making it easier to switch database backend around in production

The specific changes and associated Wikimedia production tasks involved here are:

  • Dropping rev_comment, adding rev_comment_id. (T166733, T215466)
    • Ready to go!
  • Dropping rev_user and rev_user_text, adding rev_actor. (T188327, T215466)
    • Ready to go!
  • Dropping rev_text_id, rev_content_model, and rev_content_format. (T238958, T238966)
    • Ready to go!
  • Fixing the type of rev_timestamp on old wikis to match tables.sql. (P8433)
    • Ready to go!

Details

Related Gerrit Patches:

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

On thing that we discussed briefly again during wmhack was partitioning/sharding of database tables.
We may want to put the page ID into the slot table, and perhaps also into the content table, to allow them to be sharded based on the page ID.

The assumption is that if multiple revisions are needed in a single request, they typically belong to the same page.
Note that adding the page ID to the content table prevents content rows to be re-used across pages. Such sharing, while theoretically possible, was however never planned, since it is expected to yield little benefit for expected use cases, since the content of different pages will rarely be exactly the same. This assumption may however fail for very "narrow" content slots, e.g. for tracking quality assessments, since many pages would potentially share the same quality assessment, e.g. "stub".

The assumption is that if multiple revisions are needed in a single request, they typically belong to the same page.

The only really common case where slots/content for multiple revisions are needed that I can think of would be diffs, where the new and old content needs to be fetched.

I don't know how common it is for anything else to either dive into a page's history with content fetching for multiple revisions at once (where a page_id sharding would be useful) or to fetch recent changes with content for patrolling or other analysis (where a page_id sharding would be in the way).

Note that adding the page ID to the content table prevents content rows to be re-used across pages.

It would also mean that content table rows would need to be updated when someone does Special:MergeHistory.

The assumption is that if multiple revisions are needed in a single request, they typically belong to the same page.

That is not entirely correct. You ask for many revisions when:

  • you ask for all revision of a page (s/ALL/501)
  • you ask for all revision of a user (s/... you get the idea)
  • you ask for all revisions happening after a particular timestamp
  • If that was true, revision would not have 6 indexes, because all accesses would be by PK (or a single key- user, page, etc.)
  • More examples: filtering revisions, even if you get only 1, normally requires reading a large amount of non-returned revisions, like it used to happen when users clicked on <-- previous revision. Range scans like that are very common, and they are the main reason why revision is a huge pain in query time

Note that I am not saying we should not shard, or other methods should not happen, I am just commenting the current reality of things making it not as simple as just deploying partitioning . I expressed many times the problem with large range scans on revision querying millions of rows. Unlike, for example, text or external storage, which despite being very large tables, as they are always accessed through a primary key, they are very efficient.

One thing that many people may not know is that we already have partitioning deployed on the special slaves (my user), to help with performance on contributions counting and similar. It works, but it makes in many cases other kind of queries slower.

Again, I am not pushing this back, I am happy to help in any kind of experimentation in this regard- although it may be easier for newer features like MCR tables, where it is designed from 0. I was just thinking that the direction was to avoid sharding by trying other things first.

daniel added a comment.Jun 1 2017, 3:34 PM

@jcrespo We can try other things first, but if we want to shard by page later, we'll have to add a column to the tallest of the tables. I though we should rather do that right away. Having the page ID there may prove convenient for other things too.

Also note that my current idea is to shard the slot and maybe content tables. These would not be used when listing user contributions, only when loading actual page content.

Sharding the revision table is a different can of worms. But the revision table already has the page ID, so it's not something we need to discuss in this context.

MaxSem added a subscriber: MaxSem.Jul 10 2017, 10:49 PM
tstarling removed brion as the assignee of this task.Oct 17 2017, 1:07 AM
tstarling added a subscriber: brion.
daniel moved this task from Inbox to Epic on the Multi-Content-Revisions board.

What ever happened to this project? Is it on ice?

jcrespo added a comment.EditedJan 10 2018, 11:51 AM

@kaldari I think most of the efforts are currently going towards comment separation/normalization on T6715/T166733 (deployment in progress) and on user normalization on T167246. This is such a generic, epic ticket that most work is done on those subtickets. It is not fast because it requires a lot of code changes, technical debt cleanup (old data in a bad format) and long running schema changes- but as far as I can see it is going well.

There is also work on MCR T174043, which should split the revision table into 2.

I think one those 3 refactorings are done, I assume reevaluation will be done and we will see how compact revision is, improving at the same time how fast newer schema changes can be done.

More specifically,

More specifically,

(Added that last one as T184615.)

After those three components are complete, the idea is to declare this task "done", and if there is scope for further discussion at that point it should be its own task?

Change 350097 abandoned by Brion VIBBER:
WIP - provisional revision table restructure

Reason:
Abandoning this old experimental patch.

https://gerrit.wikimedia.org/r/350097

Nirmos added a subscriber: Nirmos.Apr 19 2018, 3:54 PM
daniel moved this task from Epic to Watching on the Multi-Content-Revisions board.May 7 2018, 10:44 AM

Changes that affect the revision table will usually also affect the archive table as well. But there is also something that we could do with just the archive table. Perhaps, we could create a page_archive table with columns named pa_id, pa_namespace, pa_title, pa_page_id, and pa_rev_count. Then, the ar_namespace, ar_title, and ar_page_id fields would be migrated to the new table, and we could then add in an ar_pa_id column to the archive table that points to a row in the page_archive table. Also, when a page is being deleted, all the deleted revisions would have a single pa_id, and the total number of revisions the page had prior to the deletion would then become the pa_rev_count. Finally, undeletion would delete row(s) from page_archive table or lower the pa_rev_count field(s) if necessary.

Anomie updated the task description. (Show Details)May 17 2019, 3:36 PM
Anomie updated the task description. (Show Details)Sep 20 2019, 2:04 PM
Iflorez added a subscriber: Iflorez.Oct 3 2019, 7:04 PM
Anomie updated the task description. (Show Details)Nov 21 2019, 9:24 PM

Change 552339 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Alter revision for actor, comment, and MCR

https://gerrit.wikimedia.org/r/552339

Status: This task is now unblocked! I've uploaded a patch for the database part of it, however there should probably be one before it that drops $wgMultiContentRevisionSchemaMigrationStage, or at least makes MediaWiki throw early if it's set to anything other than SCHEMA_COMPAT_NEW.

Anomie updated the task description. (Show Details)Dec 6 2019, 3:00 PM
Anomie updated the task description. (Show Details)