Page MenuHomePhabricator

Design and merge the new tables of file tables
Closed, ResolvedPublic

Description

There will be three new tables:

  • file
  • filerevision
  • deleted_files.
  • (more?)

Details of the schema needs to be hashed out, added, and merged. Preferably with POC so you can try read and write locally and see how it looks like.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ladsgroup triaged this task as Medium priority.Jun 20 2024, 10:26 PM
Ladsgroup moved this task from Triage to In progress on the DBA board.

deleted_files

Note currently we do not use a table to store deleted pages. One of solutions in T20493: RFC: Unify the various deletion systems represents deleted pages using one bit field, so there are no need for a deleted pages (or archive/deleted revisions) table. Similarly we can use a bit field to indicate whether a file is deleted. This will also have the benefit of keeping the (upcoming) file ID upon deletion and undeletion.

Per T28741#9912401, we may want a new table to stored normalized img_media_type, img_major_mime and img_minor_mime.

Per parent task needed columns are:
file table:

  • file_id
  • file_latest
  • file_name
  • file_type (normalized type)
  • file_delete

filerevision:

  • fr_id
  • fr_file
  • fr_archive_name (if can not be generated automatically)
  • fr_size
  • fr_width
  • fr_height
  • fr_bits
  • fr_description_id
  • fr_actor
  • fr_timestamp
  • fr_metadata
  • fr_type (normalized type)
  • fr_deleted (for revdel)
  • fr_sha1 (if we need to keep backwards compatibility)
  • fr_delete (for normal deletion, unless unified with revdel - see T20493)
  • fr_sha256
  • fr_perceptual_hash

This might affect some data we sqoop into HDFS and some of how we compute commons impact metrics or similar future metrics. We have to wait until a schema change is proposed to know for sure.

I had to deprioritize this for a bit to deal with the aftermath of the outages. I will get back to it next week.

Change #1091477 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] New schema of file tables

https://gerrit.wikimedia.org/r/1091477

Change #1100125 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] file: Basic support for writing to the new file tables

https://gerrit.wikimedia.org/r/1100125

Change #1100125 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] file: Basic support for writing to the new file tables

https://gerrit.wikimedia.org/r/1100125

Note: the proposed migration path seems not functionally separate deletion and revdel. See T20493#10389320 for why this is a bad idea for page.

I am aware. That's why the patch is WIP

Change #1091477 merged by jenkins-bot:

[mediawiki/core@master] schema: Introduce file table

https://gerrit.wikimedia.org/r/1091477

Change #1100125 merged by jenkins-bot:

[mediawiki/core@master] file: Basic support for writing to the new file tables

https://gerrit.wikimedia.org/r/1100125

Ladsgroup moved this task from In progress to Done on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2025-01-22T12:33:59Z] <Amir1> creating new schema of file tables everywhere (T368113)