Page MenuHomePhabricator

Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys)
Open, NormalPublic

Description

The most recent version of each image is stored in the image table, with historic versions stored in the oldimage table. This structure resembles cur and old, with all the nastiness that comes with it. In MW 1.5 we ditched cur/old in favor of page/revision, but we never did a similar thing for files.

We talked about this a bit at the 2011 Amsterdam Hackathon and decided that, while it's complex and hairy, we want to do it at some point in the future.


RFC

In a nut shell

Currently, the oldimage table contains information about non-current file revisions, and the image table contains information about the current file revisions as well as the unversioned information about a file in general. This means that to query all revisions, MediaWiki has to query two tables. It also means that whenever a file gains a new revision, the "current" data has to be moved across from one table to another.

The approved changes will rename oldimage to filerevision and add the missing rows for the current file revisions. Then, the image (now: file) table can be reduced to just containing the generic file information. This matches the way the page and revision tables have been organised in the database since MediaWiki 1.5 (before the page/revision tables existed, MW used curpage/oldpage tables which is similar to these old image tables, and is why today we still have a notion of curid=pageid and oldid=revid).

Approved changes

  • Add fields:
    • img_id: New primary key for image.
    • img_latest: Pointer to oldimage.oi_id for the current revision of a file (similar to how page.page_latest points to revision.rev_id)
    • oi_id: New primary key for oldimage.
    • oi_img: Pointer to image.img_id for the file this revision corresponds to (similar to how revision.rev_page points to page.page_id)
  • Rename tables:
    • Rename image to file.
    • Rename oldimage to filerevision.
  • Add missing rows in filerevision for current revisions, based on rows from image.
  • Reduce fields in file to only be the minimum amount of derived data we need indexed for the "current" revision. Currently indexed:  img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime, img_size.
    • img_timestamp: Remove. Only used for file history, redundant now with the missing rows now present in filerevision.
    • img_user_text: Remove. Only used for file history (ApiQueryAllImages, SpecialMIMEsearch), these queries can instead query or join filerevision directly.
    • img_sha1: Keep. Needed for reverse look-up in duplication detection on current file revisions.
    • img_media_type img_major_mime, img_minor_mime: Keep. Needed for SpecialMIMESearch. Note that (while out of scope of this task), there has been talk about creating a separate RFC for removing these fields in favour of a SearchEngine-based approach.
    • img_size: Remove. Not needed as indexed field. Only used by checkImages.php and Special:MediaStatistics (SUM query), these can directly query/join against filerevision.
    • img_width, img_height: Remove. Not needed as indexed field. Only existed because oldimage didn't contain current revisions, these fields are redundant in filerevision.
    • img_bits: Remove. Not needed. Redundant now with filerevision.
    • img_description: Remove. Not needed. Redundant now with filerevision.
    • img_user: Remove. Not needed. Redundant now with filerevision. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) which could query/join filerevision instead. Though, should probably use recentchanges.rc_bot instead. (unrelated)

Proposed migration strategy

Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia Foundation). A few ideas so far:

  • Tim Starling, T589#2747454:
    • Rename image to filerevision. Create a view called image.
    • Add new fields to filerevision.
    • Create file table.
    • (Disable uploads.)
    • Populate file from filerevision.
    • Also move oldimage rows into filerevision. (These will be made invisible from the image view.)
    • (Deploy new MediaWiki version that uses file/filerevision.)
    • (Re-enable uploads)
    • Drop image and oldimage.
  • Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.

Details

Reference
bz26741

Related Objects

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 11:23 PM
bzimport set Reference to bz26741.
bzimport added a subscriber: Unknown Object (MLST).
Catrope created this task.Jan 15 2011, 10:41 AM

Adding keywords in the title for easier searching. Spent 4 minutes trying to find this.

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:41 PM
Restricted Application added subscribers: Steinsplitter, Matanya, Aklapper. · View Herald TranscriptSep 4 2015, 6:41 PM
Danny_B moved this task from Unsorted to Cleanup on the Schema-change board.May 4 2016, 3:08 AM
Restricted Application added a project: Commons. · View Herald TranscriptMay 4 2016, 3:08 AM
Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptJul 1 2016, 5:21 PM

This task is for the schema migration itself.

RFC T589 is about seeking consensus on whether we'll do this, and how.

Krinkle renamed this task from Clean up file tables (image and oldimage) database schema ('curr' and 'old' strike back!) to Migrate file tables to a modern layout (image/oldimage; file/file_revision).Jul 1 2016, 5:28 PM
Krinkle updated the task description. (Show Details)
Krinkle set Security to None.
Krinkle removed a subscriber: wikibugs-l-list.
Ltrlg added a subscriber: Ltrlg.Jul 3 2016, 5:29 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Feb 8 2017, 5:36 PM

Although this is approved.. Is anyone likely to work on it soon?

Trying to decide if T146568: Add a primary key to oldimage is worth the effort of getting done, and the DBAs to deploy in the meantime (to potentially facilitate other schema changes) - chances are, if it's not going to be worked on soon, it's more likely worth the effort

For image I would ask for an estimation to reformat the image tables -if anyone plans to work on it soon-, if it is long term it may be reasonable to deploy a PK shorter term (it may even facilitate further schema changes). Edit: it already has img_name ?

(read as oldimage, not image :))

Paladox added a subscriber: Paladox.Sep 5 2017, 6:24 PM
gpaumier removed a subscriber: gpaumier.Jul 18 2018, 5:57 PM
Krinkle renamed this task from Migrate file tables to a modern layout (image/oldimage; file/file_revision) to Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys).Feb 8 2019, 6:15 PM

As an alternative, I proposed (to Timo, aloud) migrating straight to an MCR-based scheme (with a binaryPointer slot for the file reference) instead of jumping to this and then away again. I'll write it up into a task.

One comment, Re: "add primary keys"- image already has a primary key, oldimage doesn't. I am not saying krinkle thinks that, but I have seen other people making the same mistake about our recommendation- we ask for every table to have a PK, we don't ask for it to be a numeric/id one- there are some reasons and advantages to do that (in certain circumstances, in others there is no other way), but from the pure administration point of view, any PK, string or numeric, is good enough.

Krinkle added a comment.EditedFeb 8 2019, 6:42 PM

@Jdforrester-WMF Yeah, just note though, that I also mentioned (per the original RFC discussion) that we believed at the time, that it was infeasible to adopt MCR without first doing this intermediary stage due to there not being primary keys on anything on oldimage. We may however be able to do a slightly more minimal version, but it might make maintenance more complicated and error prone. I think it'll be very hard to do the kind of migration we did for comment/actor tables with files without primary keys.


@jcrespo Thanks, I didn't realise that. I guess the unique index we used to have was easy to change to a PK. However, we don't want to use that as a foreign key from oldimage given that files can be renamed. So while not for DBA-purposes, we do still need a stable primary key img_id for image in addition to the img_name we have today.

I am also fearful of extreme normalization, as that can create scalability issues- I don't have any opinion, of course, until I see the proposal, but please keep me in the loop.

daniel added a comment.Feb 8 2019, 8:29 PM

If feasible, I'm in favor of porting the primary storage of image meta-data (including the filesystem path of the file) to MCR. The image table should probably be kept as a secondary storage of the relevant data of the current version of the image. oldimage and filearchive could both go away.

This means we have to inject rows into the revision table for oldimage (if no dummy revision exists for the upload). Similarly, we would have to inject rows into the archive table for rows in filearchive.

If we intend to drop these tables, it seems pointless to add primary keys to them now. On the other hand, migrating the data becomes a lot easier with primary keys in place. I'm not entirely sure the migration is possible without data loss if we don't have the primary keys in place.

Tgr added a subscriber: Tgr.Feb 8 2019, 9:23 PM

We need the file metadata for the current version of the file and basic info like size and mime type for the old/deleted versions, so we'd still need two tables, a page analog and a content analog.

It's also not clear how revision deletion would work (right now you can delete description page revisions without affecting the file, and file revisions without affecting the page). So definitely needs a more detailed spec.

daniel added a comment.EditedFeb 9 2019, 4:37 AM
In T28741#4939578, @Tgr wrote:

We need the file metadata for the current version of the file and basic info like size and mime type for the old/deleted versions, so we'd still need two tables, a page analog and a content analog.

These would not be analog, they would be page and content (and revision and slots, to glue these together). The image table itself would remain to provide access to some meta-data without having to load the content object that contains that metadata.

It's also not clear how revision deletion would work (right now you can delete description page revisions without affecting the file, and file revisions without affecting the page). So definitely needs a more detailed spec.

The notion of "file revisions" would go away entirely. The upload history could still be displayed based on a filtered view of the page's revisions (by finding slots that have slot_origin = slot_revision_id and slot_role = FileMetaData)

Tgr added a comment.EditedFeb 9 2019, 7:40 PM

These would not be analog, they would be page and content (and revision and slots, to glue these together). The image table itself would remain to provide access to some meta-data without having to load the content object that contains that metadata.

So just turn oldimage into a key-value store, with keys stored in content_address? I guess that would work, as long as we don't care about searching in file history (which we currently don't). Operations that need information about all old file versions at the same time (such as image deletion, or rendering the history section of the image description page) would become a little more expensive (in theory a lot more expensive in anomalous cases, where someone creates a file page with few file revisions but a huge number of wikitext revisions).

The notion of "file revisions" would go away entirely. The upload history could still be displayed based on a filtered view of the page's revisions (by finding slots that have slot_origin = slot_revision_id and slot_role = FileMetaData)

There is no efficient way to do that query, although I guess having to do it on a huge number of revisions is an edge case that can be safely ignored. More problematically, it would tie file revision deletion to description revision deletion, as I said (e.g. you upload a new version of the file which has a problematic nonfree logo cropped out, now you'd have to delete the entire page history).

Tgr added a comment.Feb 9 2019, 7:42 PM

Also, file moves would take some thought - currently those change the name of the archive entries, I think, and content blobs are supposed to be immutable. Maybe we should do the switch to sha1 filenames first?

Tgr added a comment.Feb 9 2019, 7:44 PM

Also revision deletion would have to trigger file version deletion, but only if all revisions of the page that reference a certain version of the file have been deleted. That would have to be a whole new mechanism, and does not fit neatly into MCR which treats content objects as completely opaque.

In T28741#4940685, @Tgr wrote:

Also revision deletion would have to trigger file version deletion, but only if all revisions of the page that reference a certain version of the file have been deleted. That would have to be a whole new mechanism, and does not fit neatly into MCR which treats content objects as completely opaque.

Good catch, this indeed needs more thought.

jcrespo added a comment.EditedFeb 11 2019, 9:59 AM

As I said multiple times, hypernormalization would bring performance concerns- right now image and revision are handled on separate tables. Unless you also implement MCR sharding, we should try to make separate concerns into separate tables. Let's not convert Mediawiki into Drupal. That doesn't mean image could not be a "slot", but leave metadata to its own separate table.

As I said multiple times, hypernormalization would bring performance concerns- right now image and revision are handled on separate tables. Unless you also implement MCR sharding, we should try to make separate concerns into separate tables. Let's not convert Mediawiki into Drupal. That doesn't mean image could not be a "slot", but leave metadata to its own separate table.

Currently, every upload currently creates a row in the oldimage (or file) table as well as the revision table and the slots table (but not the content table). If we used MCR for storing the primary image meta-data, every upload would create a row in the revision, the slots, and the content table (as well as the image table if the file did not previously exist). filearchive and oldimage would go away, and with them the need to copy rows between these tables.

The idea is indeed to leave metadata of the current revision in its own separate table, but as a secondary copy, not primary information; and to remove the need for maintaining it for old or archived revisions.

I'm not sure I understand correctly what you mean by MCR sharding. What table would need to be sharded in your opinion? And which tables would grow significantly more than they do now?