Page MenuHomePhabricator

Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys)
Open, MediumPublic

Description

The most recent version of each image is stored in the image table, with historic versions stored in the oldimage table. This structure resembles cur and old, with all the nastiness that comes with it. In MW 1.5 we ditched cur/old in favor of page/revision, but we never did a similar thing for files.

We talked about this a bit at the 2011 Amsterdam Hackathon and decided that, while it's complex and hairy, we want to do it at some point in the future.


RFC

In a nut shell

Currently, the oldimage table contains information about non-current file revisions, and the image table contains information about the current file revisions as well as the unversioned information about a file in general. This means that to query all revisions, MediaWiki has to query two tables. It also means that whenever a file gains a new revision, the "current" data has to be moved across from one table to another.

The approved changes will rename oldimage to filerevision and add the missing rows for the current file revisions. Then, the image (now: file) table can be reduced to just containing the generic file information. This matches the way the page and revision tables have been organised in the database since MediaWiki 1.5 (before the page/revision tables existed, MW used curpage/oldpage tables which is similar to these old image tables, and is why today we still have a notion of curid=pageid and oldid=revid).

Approved changes

  • Add fields:
    • img_id: New primary key for image.
    • img_latest: Pointer to oldimage.oi_id for the current revision of a file (similar to how page.page_latest points to revision.rev_id)
    • oi_id: New primary key for oldimage.
    • oi_img: Pointer to image.img_id for the file this revision corresponds to (similar to how revision.rev_page points to page.page_id)
  • Rename tables:
    • Rename image to file.
    • Rename oldimage to filerevision.
  • Add missing rows in filerevision for current revisions, based on rows from image.
  • Reduce fields in file to only be the minimum amount of derived data we need indexed for the "current" revision. Currently indexed:  img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime, img_size.
    • img_timestamp: Remove. Only used for file history, redundant now with the missing rows now present in filerevision.
    • img_user_text: Remove. Only used for file history (ApiQueryAllImages, SpecialMIMEsearch), these queries can instead query or join filerevision directly.
    • img_sha1: Keep. Needed for reverse look-up in duplication detection on current file revisions.
    • img_media_type img_major_mime, img_minor_mime: Keep. Needed for SpecialMIMESearch. Note that (while out of scope of this task), there has been talk about creating a separate RFC for removing these fields in favour of a SearchEngine-based approach.
    • img_size: Remove. Not needed as indexed field. Only used by checkImages.php and Special:MediaStatistics (SUM query), these can directly query/join against filerevision.
    • img_width, img_height: Remove. Not needed as indexed field. Only existed because oldimage didn't contain current revisions, these fields are redundant in filerevision.
    • img_bits: Remove. Not needed. Redundant now with filerevision.
    • img_description: Remove. Not needed. Redundant now with filerevision.
    • img_user: Remove. Not needed. Redundant now with filerevision. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) which could query/join filerevision instead. Though, should probably use recentchanges.rc_bot instead. (unrelated)

Proposed migration strategy

Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia Foundation). A few ideas so far:

  • Tim Starling, T589#2747454:
    • Rename image to filerevision. Create a view called image.
    • Add new fields to filerevision.
    • Create file table.
    • (Disable uploads.)
    • Populate file from filerevision.
    • Also move oldimage rows into filerevision. (These will be made invisible from the image view.)
    • (Deploy new MediaWiki version that uses file/filerevision.)
    • (Re-enable uploads)
    • Drop image and oldimage.
  • Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Dropping a note here after a chat with @Cparle that we'd ideally want to add the proposed new column for perceptual hashes (T121797 / T251026) as part of this work, rather than migrating the tables twice.

Ladsgroup added a project: Data-Persistence.

In the words of esteemed software architect, P!nk: "[Let's] get the party started"

Another things to potentially work on together (similar to perceptual hashes) is T158989: Add img_sha256, related columns with SHA-256 hash of file.

Hi @Bugreporter thank you for your comments. I am aware of the tickets you brought up but for most of them, they won't be implemented in this migration.

Note: the proposed new schema tie media type to media file, not revision. It will be awkward if user uploads a new revision of file with same file name (and thus file extension) but different media type, since at that time media type is not updated. For example, videos can be uploaded using .ogg extension, though there are bots to move them in Commons

Krinkle renamed this task from Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys) to Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys).Jan 9 2025, 6:36 PM

Change #1111240 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] file: Move filerevision insert within endAtomic() block

https://gerrit.wikimedia.org/r/1111240

Change #660422 abandoned by Ladsgroup:

[mediawiki/core@master] Introduce img_id column

Reason:

Not needed

https://gerrit.wikimedia.org/r/660422

Note: the proposed new schema tie media type to media file, not revision. It will be awkward if user uploads a new revision of file with same file name (and thus file extension) but different media type, since at that time media type is not updated. For example, videos can be uploaded using .ogg extension, though there are bots to move them in Commons

This has always been the case though, so I don't see it as a problem, and it is why we have T6421 as well.

I do agree that the new layout may cause a bit of an issue when we need to render thumbnails of an old revision. I think that in that case, we will have to reparse the old file revision, to retrieve its correct actual media type, instead of relying what is the already known type.. And the same might come into play when you revert between two revisions and the file type has changed.

Has anyone looked at how to support thumb nailing for old file revisions, with the new structure ? If I understand the tickets correctly, the earlier concept had the type at the file revision AND the file table, but I'm not entirely sure.

Change #1111240 merged by jenkins-bot:

[mediawiki/core@master] file: Move filerevision insert within endAtomic() block

https://gerrit.wikimedia.org/r/1111240

Note: the proposed new schema tie media type to media file, not revision. It will be awkward if user uploads a new revision of file with same file name (and thus file extension) but different media type, since at that time media type is not updated. For example, videos can be uploaded using .ogg extension, though there are bots to move them in Commons

This has always been the case though, so I don't see it as a problem, and it is why we have T6421 as well.

I do agree that the new layout may cause a bit of an issue when we need to render thumbnails of an old revision. I think that in that case, we will have to reparse the old file revision, to retrieve its correct actual media type, instead of relying what is the already known type.. And the same might come into play when you revert between two revisions and the file type has changed.

Has anyone looked at how to support thumb nailing for old file revisions, with the new structure ? If I understand the tickets correctly, the earlier concept had the type at the file revision AND the file table, but I'm not entirely sure.

Shouldn't we just simply disallow re-uploading a new file with a different media type than the previous one? e.g. we don't allow for reupload of the exact same image (sha1 check). That can be added there. Of course the problem of old images with this problem remains, but it shouldn't be too hard to sort out.

Shouldn't we just simply disallow re-uploading a new file with a different media type than the previous one? e.g. we don't allow for reupload of the exact same image (sha1 check). That can be added there. Of course the problem of old images with this problem remains, but it shouldn't be too hard to sort out.

Ideally yes, but sometimes types change. Especially with the videos, where we rely on the information to be present with the first X bytes. When it isn't we fallback to a generic or audio type for instance. Then if you modify the file, to ensure the information is at the front of the file, it becomes a different type.

So yes we can do this, I just don't have a good overview of what kind of impact that would create. And of course we indeed have the historic entries.

Shouldn't we just simply disallow re-uploading a new file with a different media type than the previous one? e.g. we don't allow for reupload of the exact same image (sha1 check). That can be added there. Of course the problem of old images with this problem remains, but it shouldn't be too hard to sort out.

Ideally yes, but sometimes types change. Especially with the videos, where we rely on the information to be present with the first X bytes. When it isn't we fallback to a generic or audio type for instance. Then if you modify the file, to ensure the information is at the front of the file, it becomes a different type.

So yes we can do this, I just don't have a good overview of what kind of impact that would create. And of course we indeed have the historic entries.

I've looked at some numbers, it seems only 0.05% of the files have more than one triplet of oi_media_type, oi_major_mime, oi_minor_mime in their history. And most of them shouldn't break the thumbnailing at all. So I'm inclined not to attend to the problem immediately.

Personal checklist of stuff to do:

  • Get the migration script patch merged
  • Run it in beta cluster and make sure it works as expected
  • Write tests for write both in LocalFile and get it merged
  • Deploy the file tables into production
  • start write both in testwikis and a couple small ones
  • Communicate to the communities that these changes are coming
  • Run the migration script on the small and test wikis
  • Enable write both more everywhere
  • Start the migration script everywhere
  • Start building support for read new
  • switch beta cluster to read new
  • switch small wikis to read new
  • switch it everywhere
  • add support for stopping to the old tables
  • flip the switch
  • drop the old tables
  • drop the old code and migration config

(between each step there will be bugs discovered and addressed)

Note after this we may want to introduce a FileRevision class to replace the OldLocalFile class.

@Ladsgroup do you have a timeline for this migration?

@mforns any conflicts in this migration for Commons-Impact-Metrics?

@Ladsgroup do you have a timeline for this migration?

It depends on how fast the migration script will go. Since we are waiting for text table migration to finish before starting image table migration on commons. I think the data will be migrated in two months.

I suggest writing the code behind a feature flag and ready to flip the switch once the migration of data is done since we can't wait long with both data schema in place.

For most other wikis, the data migration is done already.

Does this mean that img table dump will change after migration to new schema automatically?

@Ladsgroup do you have a timeline for this migration?

It depends on how fast the migration script will go. Since we are waiting for text table migration to finish before starting image table migration on commons. I think the data will be migrated in two months.

I suggest writing the code behind a feature flag and ready to flip the switch once the migration of data is done since we can't wait long with both data schema in place.

For most other wikis, the data migration is done already.

See T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) and T382778: Optimize text table.

I am trying to fix my tools to use the new tables, once available. I am looking at the DB replicas, which have them enables but not complete. I have some questions which I did not find answered above:

  • How do I get the img_user_text via the filerevison table? Which field is the ID to that, and which table would it reference?
  • The fr_timestamp field appears to be an integer. How do I convert that to the previous text format, preferably within the query itself?
  • What table does the fr_description_id reference?

Thanks!

I am trying to fix my tools to use the new tables, once available. I am looking at the DB replicas, which have them enables but not complete. I have some questions which I did not find answered above:

  • How do I get the img_user_text via the filerevison table? Which field is the ID to that, and which table would it reference?

This is part of the decade-long move of these names into the actor concept (img_actor in image). You should be able to get the text via JOIN actor ON fr_actor = actor_id.

  • The fr_timestamp field appears to be an integer. How do I convert that to the previous text format, preferably within the query itself?

The fr_timestamp field is an mwtimestamp, same as used pretty much everywhere, including img_timestamp. Were you using a view that converted that to a different format?

  • What table does the fr_description_id reference?

It's the replacement for img_description_id, and refers to the same table, comment.

For more details, you can see the just-created pages https://www.mediawiki.org/wiki/Manual:File_table and https://www.mediawiki.org/wiki/Manual:Filerevision_table (and https://www.mediawiki.org/wiki/Manual:Filetypes_table) – sorry about the lack of documentation yet.

@Jdforrester-WMF Thanks!

It appears fr_actor is not exposed on the replica; I would have made the connection otherwise. As it is, I can not test any related new code because of this.

It appears fr_actor is not exposed on the replica; I would have made the connection otherwise. As it is, I can not test any related new code because of this.

Will be fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128041.

Please wait until all subtasks are closed before closing this task.

@Ladsgroup: Reported at T389586: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry moving a file; Function: LocalFileMoveBatch::doDBUpdates

Currently moving a file to a name that is used by a deleted file is broken, since it will change file_name to a name that already exists, violating unique key constraint. Until we also move deleted file revisions to file/filerevision table (i.e. kill filearchive table, which is not part of current work), we can just drop the old file row related to the target name. (But once filearchive is killed, we also need to migrate fr_file to unify two records, plus we need to handle history spliting, etc.)

PEPE1234.13 renamed this task from Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys) to Migrate fmage; file/filerevision; ad.Sep 6 2025, 1:28 PM
PEPE1234.13 closed this task as Invalid.
PEPE1234.13 removed Ladsgroup as the assignee of this task.
PEPE1234.13 raised the priority of this task from Medium to Unbreak Now!.
PEPE1234.13 removed subscribers: Wellverywell, Magnus, zdev and 51 others.
Aklapper assigned this task to Ladsgroup.
Aklapper lowered the priority of this task from Unbreak Now! to Medium.
Aklapper edited subscribers, added: Wellverywell, Magnus, zdev and 46 others; removed: VirginiaPoundstone.
TheDJ renamed this task from Migrate fmage; file/filerevision; ad to Migrate file tables to a modern layout (image/oldimage; file/filerevision; add primary keys).Sep 6 2025, 2:36 PM