Page MenuHomePhabricator

Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys)
Open, MediumPublic

Description

The most recent version of each image is stored in the image table, with historic versions stored in the oldimage table. This structure resembles cur and old, with all the nastiness that comes with it. In MW 1.5 we ditched cur/old in favor of page/revision, but we never did a similar thing for files.

We talked about this a bit at the 2011 Amsterdam Hackathon and decided that, while it's complex and hairy, we want to do it at some point in the future.


RFC

In a nut shell

Currently, the oldimage table contains information about non-current file revisions, and the image table contains information about the current file revisions as well as the unversioned information about a file in general. This means that to query all revisions, MediaWiki has to query two tables. It also means that whenever a file gains a new revision, the "current" data has to be moved across from one table to another.

The approved changes will rename oldimage to filerevision and add the missing rows for the current file revisions. Then, the image (now: file) table can be reduced to just containing the generic file information. This matches the way the page and revision tables have been organised in the database since MediaWiki 1.5 (before the page/revision tables existed, MW used curpage/oldpage tables which is similar to these old image tables, and is why today we still have a notion of curid=pageid and oldid=revid).

Approved changes

  • Add fields:
    • img_id: New primary key for image.
    • img_latest: Pointer to oldimage.oi_id for the current revision of a file (similar to how page.page_latest points to revision.rev_id)
    • oi_id: New primary key for oldimage.
    • oi_img: Pointer to image.img_id for the file this revision corresponds to (similar to how revision.rev_page points to page.page_id)
  • Rename tables:
    • Rename image to file.
    • Rename oldimage to filerevision.
  • Add missing rows in filerevision for current revisions, based on rows from image.
  • Reduce fields in file to only be the minimum amount of derived data we need indexed for the "current" revision. Currently indexed:  img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime, img_size.
    • img_timestamp: Remove. Only used for file history, redundant now with the missing rows now present in filerevision.
    • img_user_text: Remove. Only used for file history (ApiQueryAllImages, SpecialMIMEsearch), these queries can instead query or join filerevision directly.
    • img_sha1: Keep. Needed for reverse look-up in duplication detection on current file revisions.
    • img_media_type img_major_mime, img_minor_mime: Keep. Needed for SpecialMIMESearch. Note that (while out of scope of this task), there has been talk about creating a separate RFC for removing these fields in favour of a SearchEngine-based approach.
    • img_size: Remove. Not needed as indexed field. Only used by checkImages.php and Special:MediaStatistics (SUM query), these can directly query/join against filerevision.
    • img_width, img_height: Remove. Not needed as indexed field. Only existed because oldimage didn't contain current revisions, these fields are redundant in filerevision.
    • img_bits: Remove. Not needed. Redundant now with filerevision.
    • img_description: Remove. Not needed. Redundant now with filerevision.
    • img_user: Remove. Not needed. Redundant now with filerevision. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) which could query/join filerevision instead. Though, should probably use recentchanges.rc_bot instead. (unrelated)

Proposed migration strategy

Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia Foundation). A few ideas so far:

  • Tim Starling, T589#2747454:
    • Rename image to filerevision. Create a view called image.
    • Add new fields to filerevision.
    • Create file table.
    • (Disable uploads.)
    • Populate file from filerevision.
    • Also move oldimage rows into filerevision. (These will be made invisible from the image view.)
    • (Deploy new MediaWiki version that uses file/filerevision.)
    • (Re-enable uploads)
    • Drop image and oldimage.
  • Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.

Details

Reference
bz26741

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Feb 8 2017, 5:36 PM

Although this is approved.. Is anyone likely to work on it soon?

Trying to decide if T146568: Add a primary key to oldimage is worth the effort of getting done, and the DBAs to deploy in the meantime (to potentially facilitate other schema changes) - chances are, if it's not going to be worked on soon, it's more likely worth the effort

For image I would ask for an estimation to reformat the image tables -if anyone plans to work on it soon-, if it is long term it may be reasonable to deploy a PK shorter term (it may even facilitate further schema changes). Edit: it already has img_name ?

(read as oldimage, not image :))

Paladox added a subscriber: Paladox.Sep 5 2017, 6:24 PM
gpaumier removed a subscriber: gpaumier.Jul 18 2018, 5:57 PM
Krinkle renamed this task from Migrate file tables to a modern layout (image/oldimage; file/file_revision) to Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys).Feb 8 2019, 6:15 PM

As an alternative, I proposed (to Timo, aloud) migrating straight to an MCR-based scheme (with a binaryPointer slot for the file reference) instead of jumping to this and then away again. I'll write it up into a task.

One comment, Re: "add primary keys"- image already has a primary key, oldimage doesn't. I am not saying krinkle thinks that, but I have seen other people making the same mistake about our recommendation- we ask for every table to have a PK, we don't ask for it to be a numeric/id one- there are some reasons and advantages to do that (in certain circumstances, in others there is no other way), but from the pure administration point of view, any PK, string or numeric, is good enough.

Krinkle added a comment.EditedFeb 8 2019, 6:42 PM

@Jdforrester-WMF Yeah, just note though, that I also mentioned (per the original RFC discussion) that we believed at the time, that it was infeasible to adopt MCR without first doing this intermediary stage due to there not being primary keys on anything on oldimage. We may however be able to do a slightly more minimal version, but it might make maintenance more complicated and error prone. I think it'll be very hard to do the kind of migration we did for comment/actor tables with files without primary keys.


@jcrespo Thanks, I didn't realise that. I guess the unique index we used to have was easy to change to a PK. However, we don't want to use that as a foreign key from oldimage given that files can be renamed. So while not for DBA-purposes, we do still need a stable primary key img_id for image in addition to the img_name we have today.

I am also fearful of extreme normalization, as that can create scalability issues- I don't have any opinion, of course, until I see the proposal, but please keep me in the loop.

daniel added a comment.Feb 8 2019, 8:29 PM

If feasible, I'm in favor of porting the primary storage of image meta-data (including the filesystem path of the file) to MCR. The image table should probably be kept as a secondary storage of the relevant data of the current version of the image. oldimage and filearchive could both go away.

This means we have to inject rows into the revision table for oldimage (if no dummy revision exists for the upload). Similarly, we would have to inject rows into the archive table for rows in filearchive.

If we intend to drop these tables, it seems pointless to add primary keys to them now. On the other hand, migrating the data becomes a lot easier with primary keys in place. I'm not entirely sure the migration is possible without data loss if we don't have the primary keys in place.

Tgr added a subscriber: Tgr.Feb 8 2019, 9:23 PM

We need the file metadata for the current version of the file and basic info like size and mime type for the old/deleted versions, so we'd still need two tables, a page analog and a content analog.

It's also not clear how revision deletion would work (right now you can delete description page revisions without affecting the file, and file revisions without affecting the page). So definitely needs a more detailed spec.

daniel added a comment.EditedFeb 9 2019, 4:37 AM
In T28741#4939578, @Tgr wrote:

We need the file metadata for the current version of the file and basic info like size and mime type for the old/deleted versions, so we'd still need two tables, a page analog and a content analog.

These would not be analog, they would be page and content (and revision and slots, to glue these together). The image table itself would remain to provide access to some meta-data without having to load the content object that contains that metadata.

It's also not clear how revision deletion would work (right now you can delete description page revisions without affecting the file, and file revisions without affecting the page). So definitely needs a more detailed spec.

The notion of "file revisions" would go away entirely. The upload history could still be displayed based on a filtered view of the page's revisions (by finding slots that have slot_origin = slot_revision_id and slot_role = FileMetaData)

Tgr added a comment.EditedFeb 9 2019, 7:40 PM

These would not be analog, they would be page and content (and revision and slots, to glue these together). The image table itself would remain to provide access to some meta-data without having to load the content object that contains that metadata.

So just turn oldimage into a key-value store, with keys stored in content_address? I guess that would work, as long as we don't care about searching in file history (which we currently don't). Operations that need information about all old file versions at the same time (such as image deletion, or rendering the history section of the image description page) would become a little more expensive (in theory a lot more expensive in anomalous cases, where someone creates a file page with few file revisions but a huge number of wikitext revisions).

The notion of "file revisions" would go away entirely. The upload history could still be displayed based on a filtered view of the page's revisions (by finding slots that have slot_origin = slot_revision_id and slot_role = FileMetaData)

There is no efficient way to do that query, although I guess having to do it on a huge number of revisions is an edge case that can be safely ignored. More problematically, it would tie file revision deletion to description revision deletion, as I said (e.g. you upload a new version of the file which has a problematic nonfree logo cropped out, now you'd have to delete the entire page history).

Tgr added a comment.Feb 9 2019, 7:42 PM

Also, file moves would take some thought - currently those change the name of the archive entries, I think, and content blobs are supposed to be immutable. Maybe we should do the switch to sha1 filenames first?

Tgr added a comment.Feb 9 2019, 7:44 PM

Also revision deletion would have to trigger file version deletion, but only if all revisions of the page that reference a certain version of the file have been deleted. That would have to be a whole new mechanism, and does not fit neatly into MCR which treats content objects as completely opaque.

In T28741#4940685, @Tgr wrote:

Also revision deletion would have to trigger file version deletion, but only if all revisions of the page that reference a certain version of the file have been deleted. That would have to be a whole new mechanism, and does not fit neatly into MCR which treats content objects as completely opaque.

Good catch, this indeed needs more thought.

jcrespo added a comment.EditedFeb 11 2019, 9:59 AM

As I said multiple times, hypernormalization would bring performance concerns- right now image and revision are handled on separate tables. Unless you also implement MCR sharding, we should try to make separate concerns into separate tables. Let's not convert Mediawiki into Drupal. That doesn't mean image could not be a "slot", but leave metadata to its own separate table.

As I said multiple times, hypernormalization would bring performance concerns- right now image and revision are handled on separate tables. Unless you also implement MCR sharding, we should try to make separate concerns into separate tables. Let's not convert Mediawiki into Drupal. That doesn't mean image could not be a "slot", but leave metadata to its own separate table.

Currently, every upload currently creates a row in the oldimage (or file) table as well as the revision table and the slots table (but not the content table). If we used MCR for storing the primary image meta-data, every upload would create a row in the revision, the slots, and the content table (as well as the image table if the file did not previously exist). filearchive and oldimage would go away, and with them the need to copy rows between these tables.

The idea is indeed to leave metadata of the current revision in its own separate table, but as a secondary copy, not primary information; and to remove the need for maintaining it for old or archived revisions.

I'm not sure I understand correctly what you mean by MCR sharding. What table would need to be sharded in your opinion? And which tables would grow significantly more than they do now?

Since we now have MCR, we could do T96384 and get rid of the need for filerevision.

Can we have a decision whether should we go with MCR or not (probably by TechCom)? The original RFC was proposed exactly ten years ago and it might need a revisit.

Krinkle added a comment.EditedThu, Jan 14, 1:41 AM

Can we have a decision whether should we go with MCR or not (probably by TechCom)? The original RFC was proposed exactly ten years ago and it might need a revisit.

RFC T589 was approved in 2017 with knowledge of MCR. As I understand it, adding primary keys (the approved direction) would be a prerequisite regardless of whether and when we refactor it using MCR. The MCR refactor would also likely involve non-trivial product and user-facing decisions and is perhaps better resourced as its own separate endavouror at a later date (and ideally by a team willing and able to commit to its long-term ownership).

The approved solution is fairly self-contained and seems more urgent given the benefits it will yield on database performance (the story for revision/archive in T20493, applies to image/oldimage as well).

Tgr added a comment.Thu, Jan 14, 5:44 AM

I imagine the first step for MCR would be to create a slot which just contains a filerevision ID. Completely merging filerevision into revision would be a lot more work which should definitely be attempted in a single step; and I'm not sure we have compelling reasons for doing it at all. So this task is a step towards MCR images.

T90300: Suppressed username shown on File pages probably has some overlap with this.

Thanks for seeing interested people here!

One thing that people may be aware but that I don't see it mentioned is that changes to image (file), in addition to oldimage (filerevision), may also impact filearchive. Eg. if we migrate to a numerical PK, we may want to keep that PK for filearchive. Further refactoring could be done there at the same time, but that would be out of scope of this particular RFC- it should, however, be taken into account to make PKs/refactoring compatible. filearchive is in an even worse state that oldimage in terms of consistency to cause metadata loss.

I offer to start working on a data cleanup/consistency check as a previous step before refactoring to speed up later changes, if someone is available for helping me.

In T28741#6746635, @Tgr wrote:

I imagine the first step for MCR would be to create a slot which just contains a filerevision ID. Completely merging filerevision into revision would be a lot more work which should definitely be attempted in a single step; and I'm not sure we have compelling reasons for doing it at all. So this task is a step towards MCR images.

I'm not convinced it would be a lot more work. Introducing filerevisions just to make it redundant again seems like massive overhead. Moving to MCR doesn't have to be done in a single step any more than introducing filerevision would be done in a single step.

Tgr added a comment.Thu, Jan 14, 11:27 PM

@daniel how would you imagine the end stage of the MCR migration? Put what's currently an image table row into a JSON object and store it as content on some non-main slot?

tstarling added a subscriber: tstarling.EditedFri, Jan 15, 5:06 AM

The reason commonswiki.image is large is because of img_metadata -- predominantly text extracted from PDFs. I sampled 63269 images using img_sha1, and I found that the average size of an img_metadata field is 9712 bytes. It's enormous. Without compression, the storage space used by this field on commons would be about 616 GB. Breaking down metadata sizes by MIME types shows that 81% of img_metadata by size is PDFs, 10% is DjVu, 9% is JPEGs, and the rest is a rounding error. So the size of PDF/DjVu text layers on commons is about 561 GB.

So in terms of addressing @jcrespo's complaint at T222224#6738823 about backup sizes, the most important thing we can do is to rethink storage of PDF and DjVu text layers. CodeSearch indicates that this text is used for exactly two things: Wikisource (ProofreadPage) transcription defaults and search indexing. In both cases, high latency is tolerable, so moving this text to external storage would work.

Some of the metadata is much hotter. For example, the page count of a PDF is needed every time such an image is referenced during parse or any other kind of thumbnailing, but it's currently in the same multi-megabyte serialized blob as the text layer. That's why I'm talking about reconsidering text layers, rather than just moving the existing metadata blob to external store.

I used img_sha1 like '12%'. The sampling ratio is not as obvious as I thought since this is a base-36 hash, so the first two digits are not uniform. I believe the correct sampling ratio is 36^29 / 2^160 = 1/1076.5. I edited my comment above to reflect that. The size of the field summed over the whole table is assumed to be the size of the sample multiplied by 1076.5.

In T28741#6749820, @Tgr wrote:

@daniel how would you imagine the end stage of the MCR migration? Put what's currently an image table row into a JSON object and store it as content on some non-main slot?

No, the image table would stay much the same, but it would be a "secondary" table, basically specialized page_props. oldimage would probably also stay as a secondary denormalized view on revisions, to provide quick access to the upload history with thumbnails etc. File archive can probably go away. The metadata blob would either be in a slot, or written to the blob store separately.

The key change is that the source of truth for all changes to the page, be it edit or upload, will be the revision table.

Hey, daniel, from a logical/data model point of view you have my support- anything that makes image workflow more like page workflow, (more unmutable/append only; stable, unique identifiers) would be a win.

My worry would be for commonswiki (and possibly enwiki) going back to having problems of scaling due to inserting + scanning lots of rows on revision (or more technically, increasing the current ones). Thanks to your work, and others, we "stopped the fires" of the revision table, but we didn't win the war :-D. Please consider that deeper refactorings -specially those increasing size and usage of a single table- may require sharding or any other physical optimization technique (e.g. revision_page and revision_image physically separated; or user_revision/page_revision/date_revision, etc. clustered around a particular ordering- vertical partitioning) that prevents agile schema changes/fast backups and recoveries and performance loss due to heavily used tables no longer fitting into memory.

I don't have specific suggestions or blockers, but I would like you to have into account the non-logical parts (operational cost). This is not only to make sysadmins life easier- fast schema changes means also faster deployments 0:-O. Of course, this normally only applies to the top 3 traffic wikis (enwiki, commonswiki and wikidatawiki), most other wikis do not have these limitations.

My worry would be for commonswiki (and possibly enwiki) going back to having problems of scaling due to inserting + scanning lots of rows on revision (or more technically, increasing the current ones). Thanks to your work, and others, we "stopped the fires" of the revision table, but we didn't win the war :-D.

Since we already insert a "dummy" revision for every upload, this wouldn't increase the size or number of writes on the revision table. It would be exactly the same. Having more slots on all the file pages would increase the size of the slots table (if we do it retroactively for existing data, by 30% to 50%). That table is very narrow, so I'd hope that this would be acceptable.

Please consider that deeper refactorings -specially those increasing size and usage of a single table- may require sharding or any other physical optimization technique (e.g. revision_page and revision_image physically separated; or user_revision/page_revision/date_revision, etc. clustered around a particular ordering- vertical partitioning) that prevents agile schema changes/fast backups and recoveries and performance loss due to heavily used tables no longer fitting into memory.

We recently reduced the width of the revision table. If the height of the revision table (and the slots and content tables) is still a concern, we should pick up efforts of partitioning/sharding again. This used to be on the roadmap last year, but got dropped as "no longer a concern, as far as I know. This is probably not the right place to discuss it, but it seems like it would be important to have a shared understanding of the urgency of such a project.

Don't worry, I see you are already very aware of the problem I mentioned, so that gives me confidence we will be able to sort it out when there is a concrete proposal. Sadly scaling problems don't go away, just other larger appear. :-)

IMHO, this is the main wmf db issue as of now, and relevant to this ticket:

  • The size of the image table on commons. I like @tstarling comments and I am guessing would be easier to implement (don't know)?
  • The lack of stable identifiers/numerical ids (general workflow) for image versions, which makes them difficult to backup/recover/identify/can cause data loss problems. Either Daniel proposal or the one on the original RFC would work, but my guess is those are for the most part independent.

I don't have further insights on how to best solve both, unless you have specific questions for me.

CBogen added a subscriber: CBogen.Fri, Jan 15, 1:03 PM

The reason commonswiki.image is large is because of img_metadata -- predominantly text extracted from PDFs. I sampled 63269 images using img_sha1, and I found that the average size of an img_metadata field is 9712 bytes. It's enormous. Without compression, the storage space used by this field on commons would be about 616 GB. Breaking down metadata sizes by MIME types shows that 81% of img_metadata by size is PDFs, 10% is DjVu, 9% is JPEGs, and the rest is a rounding error. So the size of PDF/DjVu text layers on commons is about 561 GB.

There is a related discussion happening in T271493.

This issues is really interesting. I took a look at this a bit.

  • Yes, the text part of pdf/djvu files in img_metadata are enormous. For example. The first pdf file in list of files in commons is File:!!Abajo los solteros!! - fantasía cómico-lírica gubernamental en siete cuadros y un real decreto, en prosa (IA abajolossolteros476riba).pdf and if you look at metadata section, you wouldn't see much data but if you look at metadata using API, you'll see a huge wall of data (assuming output of OCR and it seems actually next to useless). It's 64 KB, this will add up quickly.
  • Moving text data makes a lot of sense, to where? We need to decide it first. I have a proposal. I don't know very much about files and media handling in mw so sorry if it's bad. Why not loading the text layer by reading the metadata directly from the file in the filebackend? it doesn't happen often at all. Only usecase I found was Proofread (but I'm sure elastic also uses it) and both can just look the metadata from filebackend (or through some API?). If that's not a good idea. We can have another table and move that to x1/etc. Whatever DBAs would prefer.
  • I highly recommend fixing the text layer of metadata first. Because it makes the rest of the work much easier. Once the table is ten times smaller, adding and dropping indexes would be easier too.

(We probably should create another ticket, this is derailing a bit)