Compacting the revision table
Open, NormalPublic

Description

This is the task for the schema change project documented on the wiki at User:Brion VIBBER/Compacting the revision table round 2. Part of the description is copied below.

Per ongoing discussion in ArchCom and at WikiDev17 about performance, future requirements, and future-proofing for table size it's proposed to do a major overhaul of the revision table, combining the following improvements:

  • Normalization of frequently duplicated data to separate tables, reducing the dupe strings to integer keys
  • Separation of content-specific from general-revision metadata to support:
    • Multi-content revisions allowing for storing of multiple content blobs per revision -- not related to compaction, but of great interest for structured data additions planned for multimedia and articles
  • general reduction in revision table width / on-disk size will make schema changes easier in future
  • trying to avoid inconsistencies in live index deployments
    • ideally all indexes should fit on all servers, making it easier to switch database backend around in production

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 350097 had a related patch set uploaded (by Brion VIBBER):
[mediawiki/core@master] WIP - provisional revision table restructure

https://gerrit.wikimedia.org/r/350097

Reedy moved this task from Unsorted to Cleanup on the Schema-change board.Apr 26 2017, 3:35 PM

@brion I asked some questions about including a revert detection flag and various other bits of patrolling/edit review data on the RfC talk page, but it was never responded to. Now that this project is picking up steam, could you take a look at that? I feel like if we're ever going to do tasks like T19237 that propose modifying the schema of the revision table (or, alternatively, adding an auxiliary table with more information about revisions), now is the time, since this massive schema change provides a "free" opportunity to do other related/adjacent schema changes. If we decide that we're not going to do these things, that's fine, but let's make a decision before it's too late rather than after.

daniel added a subscriber: daniel.May 5 2017, 3:15 PM

This RFC is (tentatively) scheduled for a public discussion on #wikimedia-office on May 10th, 2pm PDT, 21:00 UTC, 23:00 CEST. @brion, will you be around?

This RFC is (tentatively) scheduled for a public discussion on #wikimedia-office on May 10th, 2pm PDT, 21:00 UTC, 23:00 CEST. @brion, will you be around?

Another good question is whether @jcrespo and/or @Marostegui would be able to attend.

It depends on if I have been working that same day for 13 hours before the meeting starts or if there is something down at the time :-)

Krinkle moved this task from Backlog to Schema on the MediaWiki-Database board.May 8 2017, 1:01 AM

I am also not sure if I will be around that late, can't it be moved a bit earlier so it is not that late for Europeans?

Anomie added a comment.May 8 2017, 1:48 PM

can't it be moved a bit earlier so it is not that late for Europeans?

I'd think that's largely up to Tim, since for him the meeting is at 7am. Timezones make it difficult to have everyone in the same meeting.

Oh, I forgot he is based in Australia!

jcrespo added a comment.EditedMay 8 2017, 2:20 PM

We can have them at 8 CEST.

daniel added a comment.May 8 2017, 2:42 PM

@jcrespo Not if you want to have people from california, that's 11pm for them: https://www.timeanddate.com/worldclock/converter.html?iso=20170510T060000&p1=5671&p2=240&p3=224

Well, I guess we can ask @brion if he would be ok with a late night call...

Also, for the record, I'm personally not happy about calls at 8am.

We all live around the globe!, and if we have to decide, I think your work and input is way more important than my unintelligible rants :-) I can still respond to questions asynchronously (in advance or after the fact).

Pages 6-11 summarize my thoughts on this:

(and probably I am being too naive) and I think @brion is going in a good direction, so I do not really have much else to say except to answer further questions.

brion added a comment.May 8 2017, 4:21 PM

@daniel yes, will be around. Apologies my schedule's been a bit bumpy lately! I'm ok with funky meeting times to accommodate the others, just warn me ahead so I can make sure everything's in my calendar.

@jcrespo I've got some general concerns which I'd love to hear more from you on:

  • is there a danger to data locality / speed / caching etc when we split things up into multiple tables? (this is the main warning feedback I get from domas) I feel like the counter-force here is "with good table design the indexes will be in memory and the fetches are probably cheap anyway", does that sound right? (We're already fetching user_name in a lot of places to cover user_text, as I understand, so we're already on this path I think)
  • migration will require adding a few fields to revision first, before we can migrate data and then drop the old fields. do we feel safe enough in terms of available disk space etc? (i feel like if we were that afraid it would break immediately I'd have heard this already, but I want to triple-check ;)

@Catrope I'd be inclined to use a separate table for stuff like revision tracking, on the basis that it's tracking revert _events_ which happen _to_ revisions. Will take a peek and make a recommendation.

In the meantime I'll get back to cleaning up the work patch; got sidetracked on Revision internal API evolution for a bit.

Anomie added a comment.May 8 2017, 5:09 PM

I can still respond to questions asynchronously (in advance or after the fact).

Probably the biggest issue last time was that no one knew whether adding a new column to revision would be slow (later answered at T153333#3234049, "yes"), so time was spent discussing that.

And we hadn't thought to ask you beforehand so we didn't have your view on the "64-bit hash stored as bigint" idea, although that option was mostly not discussed in the meeting anyway.

One question that may come up in this meeting (and/or in a meeting I have tomorrow at 20:00 UTC) is the filtering of rows on Labs, both in the context of the comments table, and the actors table and content table as well. In T153333#3239160 you didn't seem to think that having Labs filter comments row visibility based on revision-deletion status would be a problem; does that hold true for actors (which would be largely pointless if not deduplicated) and content (which will probably be deduplicated for some easy cases but not necessarily for more difficult cases) too?

I think the questions you both ask are good, and I do not have a perfect answer :-/. After all, I can give you my opinion based on the work I do and things that I may experience but you many not be doing every day (like schema changes). To clarify, the answer to most questions asking "Can we?" is "technically yes", we can alter the revision table as it is now and add a new column- now, if you ask me if we can do it in less than one year while not taking most of our time bandwidth, that is a different story... On other things I can be too conservative based on previous experiences, but I could be wrong. That is why I do not like to weight on things I have not tested myself- like I did with https://phabricator.wikimedia.org/T162138#3159663 I also have some personal biases- I like simple designs and iterative cycles (but you as developers of code may not want to create so many small releases).

Re:

is there a danger to data locality / speed / caching etc when we split things up into multiple tables?

Of course- but that is why the design has to be *done around that*. I commented the good thing about recentchanges- we duplicate some of the data, but in exchange we only keep a smaller subset of it, the freshest part.

That is one of the main issues against hashes as primary keys- monotonically increasing ids have that (locality) advantage, hashes are choreographically random (and remember that rows on InnoDB are stored in PK order. If we do not think yet about normalization, latest indexes will be accessed more frequently, getting that advantage, except divided on smaller chuks. Do most of the revision accesses require comments? Don't take my opinion for granted- do tests as the one I did above! :-) Regarding actor, that is a more difficult question. My arguments in favor of that, given the small size of user_text are more about rename operations than about revision optimization. Also purely egoistically, smaller tables == faster schema changes == I can keep with your needs better in the future. This is why I am adding primary keys now- even if I am wrong- changing it in the future is much easy than without them.

Re:

do we feel safe enough in terms of available disk space etc?

We are in the middle of renewing servers, and after that, we should have one TB free avaiable on every servers. Revision table is the largest one in some servers, and while there is the possibility in some cases to do updates in place, you should think that, in order to do an alter on a table, you need as much space as the table size free + some buffer. But you shouldn't worry of that at all. Disk is cheap- when I am talking about sizes- I mean size thinking about memory and the buffer pool. Getting the comments aside would be great to me (aka the servers) because that would make the alter table in half the time and the size used temporarily half of it (allowing most of it fitting into memory, which means much faster, etc. etc.). However, I am afraid of telling you to do X or Y, because in the past I have said so and you took my words too seriously. I also do not have a good metric of "how much time does it take to code X" vs. "how much does it take to do X maintenance", or how much available time you have to do so vs how much I have. I should be serving your needs, not you mine, so I should only giving suggestions on what's easier, but development should not be compromised by maintenance unless it is a recurring thing (e.g. renaming users being a burden, inneficient storage taking too much memory). A common misconception is that joins are slow- they are not, and specially they are "faster", if instead of a monolitical text-based table (cough. *links tables), we have several (big)ints-filled on-memory tables.

Re:

you didn't seem to think that having Labs filter comments row visibility based on revision-deletion

I don't because there are similar problems on (I think is) pageprops or user props. How difficult or how easy it actually it would depend on the actual structure. If there was something that required looking up a separate table, we could put triggers to add information on the same table to simplify this. Alternatively, we could have jobs generating deferred summaries of tables, like we currently have for watchlist- we cannot share the watchlist table, but there is a process that summarizes it and adds only stats, and people in general understood why raw access was not possible, while were thankful for having a summary pre-done, instead of a view doing a group by, which is very slow. My biggest issue in the past is that there was very little participation from devels on what columns were public, which ones were grey "sometimes public, sometimes not, sometimes depending on the wiki", and which were completely private. Us on infrastructure have not much visiblity on that, specially from random obscure extensions. Knowing that you are concerned about that, already gives me a positive evaluation of the state of things. I think the biggest problem with labs will be "breaking" some tools with schema changes, specially ahead of a public release, so we should loop the tools community soon.

Please do not give me too much attention, I believe you are all in the right track and know much more about mediawiki than I do (you have 15 years more of experience than I do, every single quirk, and you know how to develop, unlike me!). I just handle the hardware! :-) I think design-by committee sometimes overthink too much, and we need more wrong tests that are discarded quickly, and smaller incremental improvements that motivate both users and developers that do them- but again, that is my opinion.

Sorry for the wall of text.

This RFC was discussed in a public IRC meeting on May 10.

Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-05-10-21.02.html
Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-05-10-21.02.log.html

It seems the effort is on track, but various details still need thought.

brion added a comment.EditedMay 20 2017, 1:30 PM

Per discussion at wmhack 2017 we're planning to split this into separate pieces we can work on in parallel and potentially deploy separately:

  • comment table (T153333) plus a temporary rev<->comment association table
    • consider also starting on use of comment in other tables, since it'll be more tractable
      • revision, archive, logging, recentchanges, image (img_description), oldimage (oi_description), filearchive (fa_description, fa_deleted_reason), ipblocks (ipb_reason), protected_titles (pt_reason)
      • extension tables: cu_log.cul_reason, cu_changes.cuc_comment, flow_revision.rev_mod_reason, globalblocks.gb_reason, global_block_whitelist.gbw_reason
    • some tables may need association table added as well, others may be small enough
  • actor table plus a temporary rev<->actor association table
    • consider also starting on use of actor in other tables, since it'll be more tractable
      • revision, archive, logging, recentchanges, image, oldimage, filearchive, ipblocks (ipb_by)
    • some tables may need association table added as well,
  • content plus slot association table

(Each will have its own transition mode.)

and then:

  • dropping old fields and merging the comment & actor associations into revision/etc.

MUST CONFIRM WITH DBAS WHICH TABLES ARE REASONABLE TO ALTER IMMEDIATELY AND WHICH ARE TOO BIG AND NEED EXTRA TRANSITION LOVE!

We hope to have core code workable in transitional state by Wikimania. :) Aggressive but we think doable. 'Long tail' of fixes may take longer, as will full deployments.

Need to get a better handle on what the extension landscape is: some must be updated for use of core tables, others will be potential candidates for making use of comment & actor.

Need to think about what to do to other DB backends.

On thing that we discussed briefly again during wmhack was partitioning/sharding of database tables.
We may want to put the page ID into the slot table, and perhaps also into the content table, to allow them to be sharded based on the page ID.

The assumption is that if multiple revisions are needed in a single request, they typically belong to the same page.
Note that adding the page ID to the content table prevents content rows to be re-used across pages. Such sharing, while theoretically possible, was however never planned, since it is expected to yield little benefit for expected use cases, since the content of different pages will rarely be exactly the same. This assumption may however fail for very "narrow" content slots, e.g. for tracking quality assessments, since many pages would potentially share the same quality assessment, e.g. "stub".

The assumption is that if multiple revisions are needed in a single request, they typically belong to the same page.

The only really common case where slots/content for multiple revisions are needed that I can think of would be diffs, where the new and old content needs to be fetched.

I don't know how common it is for anything else to either dive into a page's history with content fetching for multiple revisions at once (where a page_id sharding would be useful) or to fetch recent changes with content for patrolling or other analysis (where a page_id sharding would be in the way).

Note that adding the page ID to the content table prevents content rows to be re-used across pages.

It would also mean that content table rows would need to be updated when someone does Special:MergeHistory.

The assumption is that if multiple revisions are needed in a single request, they typically belong to the same page.

That is not entirely correct. You ask for many revisions when:

  • you ask for all revision of a page (s/ALL/501)
  • you ask for all revision of a user (s/... you get the idea)
  • you ask for all revisions happening after a particular timestamp
  • If that was true, revision would not have 6 indexes, because all accesses would be by PK (or a single key- user, page, etc.)
  • More examples: filtering revisions, even if you get only 1, normally requires reading a large amount of non-returned revisions, like it used to happen when users clicked on <-- previous revision. Range scans like that are very common, and they are the main reason why revision is a huge pain in query time

Note that I am not saying we should not shard, or other methods should not happen, I am just commenting the current reality of things making it not as simple as just deploying partitioning . I expressed many times the problem with large range scans on revision querying millions of rows. Unlike, for example, text or external storage, which despite being very large tables, as they are always accessed through a primary key, they are very efficient.

One thing that many people may not know is that we already have partitioning deployed on the special slaves (my user), to help with performance on contributions counting and similar. It works, but it makes in many cases other kind of queries slower.

Again, I am not pushing this back, I am happy to help in any kind of experimentation in this regard- although it may be easier for newer features like MCR tables, where it is designed from 0. I was just thinking that the direction was to avoid sharding by trying other things first.

daniel added a comment.Jun 1 2017, 3:34 PM

@jcrespo We can try other things first, but if we want to shard by page later, we'll have to add a column to the tallest of the tables. I though we should rather do that right away. Having the page ID there may prove convenient for other things too.

Also note that my current idea is to shard the slot and maybe content tables. These would not be used when listing user contributions, only when loading actual page content.

Sharding the revision table is a different can of worms. But the revision table already has the page ID, so it's not something we need to discuss in this context.

MaxSem added a subscriber: MaxSem.Jul 10 2017, 10:49 PM
tstarling removed brion as the assignee of this task.Oct 17 2017, 1:07 AM
tstarling added a subscriber: brion.
daniel moved this task from Inbox to Epic on the Multi-Content-Revisions board.

What ever happened to this project? Is it on ice?

jcrespo added a comment.EditedJan 10 2018, 11:51 AM

@kaldari I think most of the efforts are currently going towards comment separation/normalization on T6715/T166733 (deployment in progress) and on user normalization on T167246. This is such a generic, epic ticket that most work is done on those subtickets. It is not fast because it requires a lot of code changes, technical debt cleanup (old data in a bad format) and long running schema changes- but as far as I can see it is going well.

There is also work on MCR T174043, which should split the revision table into 2.

I think one those 3 refactorings are done, I assume reevaluation will be done and we will see how compact revision is, improving at the same time how fast newer schema changes can be done.

More specifically,

More specifically,

(Added that last one as T184615.)

After those three components are complete, the idea is to declare this task "done", and if there is scope for further discussion at that point it should be its own task?

Change 350097 abandoned by Brion VIBBER:
WIP - provisional revision table restructure

Reason:
Abandoning this old experimental patch.

https://gerrit.wikimedia.org/r/350097

Nirmos added a subscriber: Nirmos.Apr 19 2018, 3:54 PM
daniel moved this task from Epic to Watching on the Multi-Content-Revisions board.May 7 2018, 10:44 AM

Changes that affect the revision table will usually also affect the archive table as well. But there is also something that we could do with just the archive table. Perhaps, we could create a page_archive table with columns named pa_id, pa_namespace, pa_title, pa_page_id, and pa_rev_count. Then, the ar_namespace, ar_title, and ar_page_id fields would be migrated to the new table, and we could then add in an ar_pa_id column to the archive table that points to a row in the page_archive table. Also, when a page is being deleted, all the deleted revisions would have a single pa_id, and the total number of revisions the page had prior to the deletion would then become the pa_rev_count. Finally, undeletion would delete row(s) from page_archive table or lower the pa_rev_count field(s) if necessary.