Page MenuHomePhabricator

Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys)
Open, MediumPublic

Description

The most recent version of each image is stored in the image table, with historic versions stored in the oldimage table. This structure resembles cur and old, with all the nastiness that comes with it. In MW 1.5 we ditched cur/old in favor of page/revision, but we never did a similar thing for files.

We talked about this a bit at the 2011 Amsterdam Hackathon and decided that, while it's complex and hairy, we want to do it at some point in the future.


RFC

In a nut shell

Currently, the oldimage table contains information about non-current file revisions, and the image table contains information about the current file revisions as well as the unversioned information about a file in general. This means that to query all revisions, MediaWiki has to query two tables. It also means that whenever a file gains a new revision, the "current" data has to be moved across from one table to another.

The approved changes will rename oldimage to filerevision and add the missing rows for the current file revisions. Then, the image (now: file) table can be reduced to just containing the generic file information. This matches the way the page and revision tables have been organised in the database since MediaWiki 1.5 (before the page/revision tables existed, MW used curpage/oldpage tables which is similar to these old image tables, and is why today we still have a notion of curid=pageid and oldid=revid).

Approved changes

  • Add fields:
    • img_id: New primary key for image.
    • img_latest: Pointer to oldimage.oi_id for the current revision of a file (similar to how page.page_latest points to revision.rev_id)
    • oi_id: New primary key for oldimage.
    • oi_img: Pointer to image.img_id for the file this revision corresponds to (similar to how revision.rev_page points to page.page_id)
  • Rename tables:
    • Rename image to file.
    • Rename oldimage to filerevision.
  • Add missing rows in filerevision for current revisions, based on rows from image.
  • Reduce fields in file to only be the minimum amount of derived data we need indexed for the "current" revision. Currently indexed:  img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime, img_size.
    • img_timestamp: Remove. Only used for file history, redundant now with the missing rows now present in filerevision.
    • img_user_text: Remove. Only used for file history (ApiQueryAllImages, SpecialMIMEsearch), these queries can instead query or join filerevision directly.
    • img_sha1: Keep. Needed for reverse look-up in duplication detection on current file revisions.
    • img_media_type img_major_mime, img_minor_mime: Keep. Needed for SpecialMIMESearch. Note that (while out of scope of this task), there has been talk about creating a separate RFC for removing these fields in favour of a SearchEngine-based approach.
    • img_size: Remove. Not needed as indexed field. Only used by checkImages.php and Special:MediaStatistics (SUM query), these can directly query/join against filerevision.
    • img_width, img_height: Remove. Not needed as indexed field. Only existed because oldimage didn't contain current revisions, these fields are redundant in filerevision.
    • img_bits: Remove. Not needed. Redundant now with filerevision.
    • img_description: Remove. Not needed. Redundant now with filerevision.
    • img_user: Remove. Not needed. Redundant now with filerevision. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) which could query/join filerevision instead. Though, should probably use recentchanges.rc_bot instead. (unrelated)

Proposed migration strategy

Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia Foundation). A few ideas so far:

  • Tim Starling, T589#2747454:
    • Rename image to filerevision. Create a view called image.
    • Add new fields to filerevision.
    • Create file table.
    • (Disable uploads.)
    • Populate file from filerevision.
    • Also move oldimage rows into filerevision. (These will be made invisible from the image view.)
    • (Deploy new MediaWiki version that uses file/filerevision.)
    • (Re-enable uploads)
    • Drop image and oldimage.
  • Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Replying here to @Marostegui's comment on the new ticket (T275268#6847723):

I understand why size-on-disk is a problem, and I can think of several ways to improve that.

What I am not clear on is in how far number-of-rows would still be a problem if bytes-per-row was low (perhaps under 100 bytes).

In other words, is it sufficient to focus on normalization, or do we need to look into sharding as well? If we do one, which is preferable? How can we decide if we need both?

I don't think that pulling the original every time is desirable, it would cause a lot of unnecessary internal network traffic. Some of those documents are in the hundreds of MB. … I also don't know how that metadata can be distributed inside PDFs, a format famous for having a lot of ways things can be done.

Generating thumbnails from PDFs can in some cases take ~20 seconds for a single page (it's executing ghostscript and "printing" the thumbnail). Even if the text layer in the PDF is stored in the page, instead of at the end of the file or whatever, there is a definite risk that extracting it will have "non-interactive" latency.

For DjVu this would probably be reasonable from that perspective: the text is stored with the page, and DjVuLibre tools can access it in both deterministic and relatively performant amounts of time (not sure about CPU/RAM efficiency, but I don't think the volume is large enough to make that concern relevant).

Dropping a note here after a chat with @Cparle that we'd ideally want to add the proposed new column for perceptual hashes (T121797 / T251026) as part of this work, rather than migrating the tables twice.

Ladsgroup added a project: Data-Persistence.

In the words of esteemed software architect, P!nk: "[Let's] get the party started"

Another things to potentially work on together (similar to perceptual hashes) is T158989: Add img_sha256, related columns with SHA-256 hash of file.

Hi @Bugreporter thank you for your comments. I am aware of the tickets you brought up but for most of them, they won't be implemented in this migration.

Note: the proposed new schema tie media type to media file, not revision. It will be awkward if user uploads a new revision of file with same file name (and thus file extension) but different media type, since at that time media type is not updated. For example, videos can be uploaded using .ogg extension, though there are bots to move them in Commons

Krinkle renamed this task from Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys) to Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys).Jan 9 2025, 6:36 PM

Change #1111240 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] file: Move filerevision insert within endAtomic() block

https://gerrit.wikimedia.org/r/1111240

Change #660422 abandoned by Ladsgroup:

[mediawiki/core@master] Introduce img_id column

Reason:

Not needed

https://gerrit.wikimedia.org/r/660422

Note: the proposed new schema tie media type to media file, not revision. It will be awkward if user uploads a new revision of file with same file name (and thus file extension) but different media type, since at that time media type is not updated. For example, videos can be uploaded using .ogg extension, though there are bots to move them in Commons

This has always been the case though, so I don't see it as a problem, and it is why we have T6421 as well.

I do agree that the new layout may cause a bit of an issue when we need to render thumbnails of an old revision. I think that in that case, we will have to reparse the old file revision, to retrieve its correct actual media type, instead of relying what is the already known type.. And the same might come into play when you revert between two revisions and the file type has changed.

Has anyone looked at how to support thumb nailing for old file revisions, with the new structure ? If I understand the tickets correctly, the earlier concept had the type at the file revision AND the file table, but I'm not entirely sure.

Change #1111240 merged by jenkins-bot:

[mediawiki/core@master] file: Move filerevision insert within endAtomic() block

https://gerrit.wikimedia.org/r/1111240

Note: the proposed new schema tie media type to media file, not revision. It will be awkward if user uploads a new revision of file with same file name (and thus file extension) but different media type, since at that time media type is not updated. For example, videos can be uploaded using .ogg extension, though there are bots to move them in Commons

This has always been the case though, so I don't see it as a problem, and it is why we have T6421 as well.

I do agree that the new layout may cause a bit of an issue when we need to render thumbnails of an old revision. I think that in that case, we will have to reparse the old file revision, to retrieve its correct actual media type, instead of relying what is the already known type.. And the same might come into play when you revert between two revisions and the file type has changed.

Has anyone looked at how to support thumb nailing for old file revisions, with the new structure ? If I understand the tickets correctly, the earlier concept had the type at the file revision AND the file table, but I'm not entirely sure.

Shouldn't we just simply disallow re-uploading a new file with a different media type than the previous one? e.g. we don't allow for reupload of the exact same image (sha1 check). That can be added there. Of course the problem of old images with this problem remains, but it shouldn't be too hard to sort out.

Shouldn't we just simply disallow re-uploading a new file with a different media type than the previous one? e.g. we don't allow for reupload of the exact same image (sha1 check). That can be added there. Of course the problem of old images with this problem remains, but it shouldn't be too hard to sort out.

Ideally yes, but sometimes types change. Especially with the videos, where we rely on the information to be present with the first X bytes. When it isn't we fallback to a generic or audio type for instance. Then if you modify the file, to ensure the information is at the front of the file, it becomes a different type.

So yes we can do this, I just don't have a good overview of what kind of impact that would create. And of course we indeed have the historic entries.

Shouldn't we just simply disallow re-uploading a new file with a different media type than the previous one? e.g. we don't allow for reupload of the exact same image (sha1 check). That can be added there. Of course the problem of old images with this problem remains, but it shouldn't be too hard to sort out.

Ideally yes, but sometimes types change. Especially with the videos, where we rely on the information to be present with the first X bytes. When it isn't we fallback to a generic or audio type for instance. Then if you modify the file, to ensure the information is at the front of the file, it becomes a different type.

So yes we can do this, I just don't have a good overview of what kind of impact that would create. And of course we indeed have the historic entries.

I've looked at some numbers, it seems only 0.05% of the files have more than one triplet of oi_media_type, oi_major_mime, oi_minor_mime in their history. And most of them shouldn't break the thumbnailing at all. So I'm inclined not to attend to the problem immediately.

Personal checklist of stuff to do:

  • Get the migration script patch merged
  • Run it in beta cluster and make sure it works as expected
  • Write tests for write both in LocalFile and get it merged
  • Deploy the file tables into production
  • start write both in testwikis and a couple small ones
  • Communicate to the communities that these changes are coming
  • Run the migration script on the small and test wikis
  • Enable write both more everywhere
  • Start the migration script everywhere
  • Start building support for read new
  • switch beta cluster to read new
  • switch small wikis to read new
  • switch it everywhere
  • add support for stopping to the old tables
  • flip the switch
  • drop the old tables
  • drop the old code and migration config

(between each step there will be bugs discovered and addressed)

Note after this we may want to introduce a FileRevision class to replace the OldLocalFile class.

@Ladsgroup do you have a timeline for this migration?

@mforns any conflicts in this migration for Commons-Impact-Metrics?

@Ladsgroup do you have a timeline for this migration?

It depends on how fast the migration script will go. Since we are waiting for text table migration to finish before starting image table migration on commons. I think the data will be migrated in two months.

I suggest writing the code behind a feature flag and ready to flip the switch once the migration of data is done since we can't wait long with both data schema in place.

For most other wikis, the data migration is done already.

Does this mean that img table dump will change after migration to new schema automatically?