Page MenuHomePhabricator

RFC: image and oldimage tables
Closed, ResolvedPublic

Description

The image and oldimage tables RfC proposes to change the database layout from having two tables for "current" and "old" image (file) records, to having one for the file, and one for file revisions.

Quoting T28741

The most recent version of each image is stored in the image table, with historic versions stored in the oldimage table. This structure resembles cur and old, with all the nastiness that comes with it. In MW 1.5 we ditched cur/old in favor of page/revision, but we never did a similar thing for files.

We talked about this a bit at the 2011 Amsterdam Hackathon and decided that, while it's complex and hairy, we want to do it at some point in the future.

RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

These are 2 ancient rules regarding databases (that I have just made up) regarding database design:

  • Do no delete rows
  • Do not move rows between tables

Speciall, INSERT... SELECT is bad (either requires a lot of locking or it is insecure, we are on the second case for performance reasons), and MySQL is already giving a warning every time revisions are archived and unarchived.

Given that the parent task says: "This structure resembles cur and old, with all the nastiness that comes with it." I probably do not need to convince anybody of how badly that is. But if I had to, just have a look at bugs such as: T139346 and T135851#2312924

@Krinkle , I don't remember if you explicitly volunteered to shepherd this one, but since you seem to be the sponsor for this change, I'm assigning this to you.

My two cents on integrating file history with page history (T96384), using Multi-Content-Revisions (T107595):

  1. when a new version of a file is uploaded, a new revision of the file description page is created, in which the "file" slot is updated (optionally, the "main" slot with the wikitext would also be updated).
  2. The "file" slot has a content model called "uploaded-media" (or just "file"), which consists of structured meta-data (json?) about the uploaded file.
  3. The most important information is the internal identifier of the file, which can be used to find this version of the file both on disk and via HTTP. Additional information about the file should probably be included, like size, mime type, hash (enabled migration to content addressable storage!), and maybe even extracted meta-data (exif).
  4. this requires internal and external identifiers that remain stable. 'Vulpes_vulpes_standing_in_snow.jpg' is not sufficient, it would have to be something like '20040912170154!Vulpes_vulpes_standing_in_snow.jpg' from the start.
  5. We could achieve this by allowing such "archive IDs" to work also on the current version. So '20040912170158!Vulpes_vulpes_standing_in_snow.jpg' should currently be an alias for the plain 'Vulpes_vulpes_standing_in_snow.jpg' (or the other way around).
  6. For display, ImagePage will have to know about the "file" slot. Some API modules will probably also have to be aware of this.

I think all the information that is currently in the image table can be managed as "slot content", but we should keep the image table as a secondary table for quick queries by size or mime type. The image table would become a "derived" table like the link tables, that gets updated when content (of the file slot) gets saved.

The oldimage table however can be dropped, after creating "fake" revisions for each upload, with the necessary information in the file slot. The same is possible for the filearchive table, with the difference that the "fake" revisions would be created in the archive table.

The result would be:

  • oldimage and filearchive table can be removed
  • full integration of file uploads with the page history, including patrolling, reverts, undeletion, etc
  • image table is still available for queries
  • file metadata would become available in XML dumps.

My two cents on integrating file history with page history (T96384), using Multi-Content-Revisions (T107595):

  1. when a new version of a file is uploaded, a new revision of the file description page is created, in which the "file" slot is updated (optionally, the "main" slot with the wikitext would also be updated).
  2. The "file" slot has a content model called "uploaded-media" (or just "file"), which consists of structured meta-data (json?) about the uploaded file.

+1

  1. The most important information is the internal identifier of the file, which can be used to find this version of the file both on disk and via HTTP. Additional information about the file should probably be included, like size, mime type, hash (enabled migration to content addressable storage!), and maybe even extracted meta-data (exif).

Potentially some overlap with derived data (derived from the file), but consistent storage is of course nice.

  1. this requires internal and external identifiers that remain stable. 'Vulpes_vulpes_standing_in_snow.jpg' is not sufficient, it would have to be something like '20040912170154!Vulpes_vulpes_standing_in_snow.jpg' from the start.
  2. We could achieve this by allowing such "archive IDs" to work also on the current version. So '20040912170158!Vulpes_vulpes_standing_in_snow.jpg' should currently be an alias for the plain 'Vulpes_vulpes_standing_in_snow.jpg' (or the other way around).

Ah, this comes back to needing permanent storage ids; I'm not sure I like the aliasing idea though it may work more simply than forcing an immediate migration...

  1. For display, ImagePage will have to know about the "file" slot. Some API modules will probably also have to be aware of this.

I think all the information that is currently in the image table can be managed as "slot content", but we should keep the image table as a secondary table for quick queries by size or mime type. The image table would become a "derived" table like the link tables, that gets updated when content (of the file slot) gets saved.

Sounds sensible... image table is mostly used for metadata lookups, but does get used for listings. I think mime type searches could be done more effectively by tagging the search index on the pages than anything we do with the image table (unless you're just pulling a list of all file of a certain type, which is probably rare).

The oldimage table however can be dropped, after creating "fake" revisions for each upload, with the necessary information in the file slot. The same is possible for the filearchive table, with the difference that the "fake" revisions would be created in the archive table.

These'll need to either stay for old items, or stay long enough for a migration.

The result would be:

  • oldimage and filearchive table can be removed
  • full integration of file uploads with the page history, including patrolling, reverts, undeletion, etc
  • image table is still available for queries
  • file metadata would become available in XML dumps.

Whee! :)

The RFC could use a more detailed problem statement, which would help to establish a scope. For example, to me it is not entirely clear whether the issues and solutions described in T66214: Define an official thumb API are in scope or not, or how those two proposals would interact.

Was handling of image revisions via multi-content revisions discussed during yesterday's the IRC session?

The RFC could use a more detailed problem statement, which would help to establish a scope. For example, to me it is not entirely clear whether the issues and solutions described in T66214: Define an official thumb API are in scope or not.

The design of the query API for imageinfo and thumbnails, and the url pattern for thumbnails itself is orthogonal to this RFC. It shouldn't matter to this RFC whether those interfaces use file names (that resolve to the current version of a mutable/versioned file), or refer to a specific version directly (using the filename/timestamp or content hash). Either public interface can be made on top of the current or here-proposed schema. In fact, even the implementation of T66214 shouldn't be affected by the schema changes. The implementation details that change in this RFC are mostly hidden behind abstract methods.

The only difference is that this RFC would introduce a new "file revision ID" concept that we may or may not want to support as parameter in the imageinfo query API.

File hashes are already stored and queryable. T66214 may need to add an index or link table for the content hashes, but that's an unrelated schema change.

I also assume that T66214 has no intention of changing the fact that we use canonical references inside wikitext/html (e.g, not embedding an older file directly). The expansion to refer to content hashes and/or versioned thumbnail urls would apply to the parser (and would need to purge pages when files change). If we want to support embedding old versions directly in wikitext, that would be a separate change also.

Summary of yesterday's IRC discussion of this RFC follows. (Full chat log at P3425.)

  • This RFC would provide stable file IDs. While they are not intended to become public identifiers, T66214 might want to use them. It would allow thumbnails (addressed by content hash) to be associated with a file ID rather than a file name. This would allow a thumbnail storage to have more stable and hierarchical structure. Making purges easier. Especially when users rename a file.
  • Having a primary key will make future schema changes easier.
  • Merely adding a primary key to the current schema would not solve Problem 2. As such, Proposal 1 was rejected.

121:02:21 <robla> #startmeeting RfC: image and oldimage tables
221:02:21 <wm-labs-meetbot`> Meeting started Wed Jul 13 21:02:21 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
321:02:21 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
421:02:21 <wm-labs-meetbot`> The meeting name has been set to 'rfc__image_and_oldimage_tables'
521:02:41 <robla> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
621:03:03 <robla> hi everyone!
721:03:27 <robla> https://phabricator.wikimedia.org/T589 is the topic for today
821:03:54 <robla> Krinkle: can you give a quick summary of the topic?
921:04:03 <Scott_WUaS> (Hi Everyone :)
1021:04:12 <Krinkle> Yep
1121:04:25 <Krinkle> Reference: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables - outlining here now
1221:05:14 <TimStarling> hmm
1321:05:19 <Krinkle> There is a fair number of problems with how we store multimedia in MediaWiki. This RFC will focus specifically on internal storage model for files and revisions.
1421:05:27 <robla> (we're also trying to have a field narrowing conversation as defined here: https://www.mediawiki.org/wiki/Good_meetings#Taxonomy )
1521:06:13 <robla> (if we're succesful, that means: "tentative consensus about the limited set of options, with someone assigned to clearly document the tentative consensus in the consensus-building forum of record")
1621:06:19 <Krinkle> I think we have a lot of flexibility on where we want to go with this. Historically, the direction has been to replace it with a page/revision-like set up where files and file revisions have numerical IDs.
1721:06:24 <TimStarling> if the idea is to get rid of INSERT SELECT then moving it to multi-content revisions won't do that
1821:06:27 <Krinkle> But recently, several other ideas have come up.
1921:07:07 <TimStarling> when jcrespo complained about how inefficient INSERT SELECT is, he linked to an incident involving filearchive, i.e. file deletion
2021:07:09 <Krinkle> For the moment, I've summarised two problems on top of the page: "File revisions should have better unique identifiers" and "Files should not be identified by their caption, but be given a language-neutral identifier."
2121:07:39 <Krinkle> TimStarling: Yeah, I think whatever we come up with, the idea of rows being moved should definitely go away.
2221:07:48 <Krinkle> Which to me seems like proposal #1 is unfit.
2321:07:50 <gwicke> those two problem statements have some overlap with the idea to use content hashes
2421:08:03 <gwicke> https://phabricator.wikimedia.org/T66214
2521:08:54 <Krinkle> gwicke: That RFC primarily deals with the thumbnail API, and HTTP api for files. It doesn't neccecarily dictate the storage model and/or how people reference them in wikitext/html.
2621:08:57 <gwicke> as I said on the task, I am not entirely sure whether this is in scope, or how it would interact
2721:09:06 <robla> #info topic of discussion - what is the scope of this RFC
2821:09:53 <Krinkle> One use case you could think about is, if we have file IDs, those will likely be per-wiki, which makes them unsuitable for use from a foreign file repo (e.g. if we have {{#file:1234}} how should that work cross-wiki, do we still allow shadowing of file names?
2921:09:56 <gwicke> much of the motivation for content-based addressing is to solve the cache invalidation / consistency issue
3021:10:24 <gwicke> which is also a part of the motivation for moving towards more stable ids that don't change when images are renamed
3121:10:54 <Krinkle> Another use case is to disallow mutation of files, and require a 1:1 mapping of a file upload and a description page - in which case re-upload would require updating all usage - which isn't practical cross-wiki and to third-parties. Though redirects could be use as canonical alias for frequently updated files.
3221:11:22 <gwicke> you could have a 1:n mapping from descriptions to files
3321:11:39 <gwicke> m:n if you factor in languages
3421:11:55 <Krinkle> Yeah, though I think that would move internal details to the user land too much. We can duplicate internally on a per-fileblob level.
3521:12:05 <Krinkle> By using hashes indeed.
3621:12:16 <Krinkle> Having a 1:n mapping for mutable files and descriptions doesn't seem useful.
3721:12:34 <gwicke> we already have that, and it's shown as a "file history"
3821:12:47 <legoktm> #1 is a good idea regardless right? All tables should have primary keys...
3921:13:00 <Krinkle> there is one description page per file history, not multiple for the same "file"
4021:13:05 <legoktm> And it doesn't prevent us from implementing any of the other solutions
4121:13:29 <Krinkle> legoktm: Yeah, it would be a very minimal change, but wouldn't solve any of the underlying problems such as killing the concept of rows being moved across tables, which is the main blocker (/me adds to problem statement)
4221:13:32 <gwicke> Krinkle: I think we agree, it's already 1:n
4321:13:39 <gwicke> one description, n file revisions
4421:14:05 <legoktm> And having the primary key will make future schema changes (whatever they are) much easier
4521:14:08 <robla> #info 14:12:48 <legoktm> #1 is a good idea regardless right? All tables should have primary keys...
4621:14:18 <Krinkle> gwicke: Ah, yes.
4721:14:29 <Krinkle> gwicke: I was thinking 1 versioned file, multiple versioned description pages
4821:14:32 <Krinkle> which we don't.
4921:14:53 <gwicke> yeah
5021:14:53 <gwicke> not currently
5121:15:07 <gwicke> although with languages, that might be in scope (?)
5221:15:20 <robla> #info <Krinkle> [primary keys] wouldn't solve any of the underlying problems such as killing the concept of rows being moved across tables, which is the main blocker (/me adds to problem statement)
5321:16:06 <gwicke> at a high level, is the goal of this RFC to clean up how the entities in the media / image space are structured?
5421:16:20 <Krinkle> I'd like to disqualify option 1 from the RFC. But I'm curious of people think that would be worth the effort to do first. Personally I don't think it would benefit much given virtually any future direction would end up replacing that intermediary state completely.
5521:16:38 <gwicke> also, are concerns around multi-project & multi-language media use in scope?
5621:16:55 <Krinkle> gwicke: What do you mean by 'how they are structured'?
5721:17:14 <gwicke> Krinkle: how the data model is structured
5821:17:24 <Scott_WUaS> (I hope all this Wikicommons' media including video can feed in to, conceptually, a Google Street View with OpenSimulator some years ahead, where we can all "wiki-add" to Google Street View and OpenSim (conceptually again) since WMF may have a film-realistic, interactive, 3D, group build-able, with avatars planned for all 8k languages with Time Slider virtual earth for wiki STEM research, on the horizon :)
5921:17:28 <legoktm> Krinkle: I'm not sure if option 1 solves the goals of what you want to do, but I think all tables do need primary keys regardless, so unless your plan is to remove the tables entirely, I don't see how it can be disqualified.
6021:17:29 <gwicke> for example, how many descriptions can we have per media blob
6121:18:07 * aude waves
6221:18:14 <Krinkle> legoktm: option 2 involves a schema change that also includes primary keys. option 3 involves dropping the tables effectively, and re-creating them as secondary data (like categories) - which would have primary keys, but for completely new rows.
6321:18:40 <robla> #info Krinkle wants to remove option #1, legoktm suggests we need primary keys regardless
6421:18:41 <Krinkle> gwicke: I think whether descriptions are smashes into 1 wikitext blob or multiple should be discussed separately.
6521:18:42 <legoktm> so really option 1 is already part of option 2 and 3?
6621:19:20 <gwicke> Krinkle: it affects the data model, in the sense that each language would probably have its own page id & history
6721:19:21 <TimStarling> fine by me to drop option 1
6821:19:37 <TimStarling> option 1 is: add img_id and oi_id to the existing tables
6921:19:40 <gwicke> while the license is probably tied to the blob itself
7021:19:50 <Krinkle> legoktm: Kind of, option 1 has one aspect the other one doeesn't: It makes moving rows more complicated by lazy-assinging primary keys as they move.
7121:19:53 <Krinkle> Which is worse in some ways.
7221:20:31 <Krinkle> It means you still don't have a revision ID until after it becomes the N-1 revision.
7321:20:43 <legoktm> okay
7421:20:54 <legoktm> As long as the tables have primary keys, I don't have an opinion :)
7521:21:02 <Krinkle> Cool
7621:21:27 <robla> jynus: welcome! we've already decided everything; hope you don't mind ;-)
7721:21:36 * aude reads the rfc
7821:21:52 <Krinkle> gwicke: If we want to split translations of descriptions onto separate wiki pages (e.g. sub pages as opposed to slots), I'd prefer to keep that separate. It woudl also involve creating a linking table of file to wiki page.
7921:21:53 <TimStarling> the particular strategy suggested in #2 may need some refinement
8021:21:59 * robla now resumes trying to be a responsible chair
8121:22:19 <TimStarling> of course with cur/old we renamed old and proxied cur, introducing a completely new page table
8221:22:24 <TimStarling> which I think was the right way to do it
8321:22:35 <jynus> INSERT SELECT is not inneficient, it is dangerous
8421:22:35 <Krinkle> I think option 2 would be good to do even if we consider option 3 later. It would put us in a better shape. But the migration is very large, so we shoudl keep in mind the migration cost.
8521:22:59 <TimStarling> revision was also a completely new table
8621:23:16 <TimStarling> basically a new abstraction layer which allowed non-intrusive migration
8721:23:19 <gwicke> I'm not entirely sure that we need an image id in addition to description ids & blob hashes
8821:23:27 <aude> Krinkle: we will have some linking like that when we add structured data support for commons
8921:23:34 <aude> but then sounds like the rfc is more generally about mediawiki
9021:23:41 <Krinkle> Yeah.
9121:24:17 <Krinkle> TimStarling: Right. If we create them as new tables, we could populate them one by one and do it again until it is caught up with the master, and then switch masters.
9221:24:28 <aude> with structured data, descriptions might be part of the data item which is multilingual
9321:24:49 <Krinkle> I don't know how oldimage and image are in terms of scale.
9421:24:56 <Krinkle> Probably bigger than enwiki was at time of live-1.5
9521:25:08 <Krinkle> Can we afford to have a complete copy of it for a short time?
9621:25:11 <Krinkle> jynus: ^
9721:25:16 <Krinkle> (in the same db)
9821:26:31 <jynus> forget about implementation
9921:26:47 <jynus> on WMF, that is my job to figure it out
10021:26:55 <TimStarling> yeah, but we are figuring it out
10121:27:02 <jynus> so do not worry about it
10221:27:17 <jynus> not the implementation
10321:27:19 <Krinkle> jynus: I was just wondering whether it is feasible space-wise to have a complete duplicate of 'image' and 'oldimage' in the commons prod as part of the migration script.
10421:27:21 <jynus> the migration, I mean
10521:27:34 <jynus> why do you need that?
10621:27:43 <Krinkle> :D
10721:28:11 <Krinkle> jynus: Tim mentioned the best way to migrate with minimal read-only time is to create new tables and populate those instead of changing the existing one.
10821:28:24 <robla> jynus: I think a lot of the thought around migration is to make sure we have something that works outside of the Wikimedia context (as well)
10921:28:34 <Krinkle> I wouldn't be surprised if full migration would take days if not weeks.
11021:28:48 <jynus> cannot we just convert image into image_revision?
11121:28:55 <gwicke> so, in a future world where descriptions are in wikidata & this RFC introduces an image id, we'd have a mapping from name to image id, Q-item to image ID(?), and image id to list of blob hashes?
11221:29:13 <Krinkle> gwicke: That sounds pretty good.
11321:29:24 <Krinkle> Note the concern about the practical use of image IDs, though.
11421:29:33 <TimStarling> this RFC is about image revision backend, we're not going to have this discussion on terms of "don't think about that, that's the DBA's job"
11521:29:46 <jynus> I do not mean that
11621:29:51 <Krinkle> Where file names (terrible as they are) are kind-of practical to use globally (across wikis), file IDs would obviously conflict if kept numerical.
11721:30:03 <jynus> I mean that sometimes you block tourselves thinking "that cannot be done"
11821:30:17 <jynus> and 99% of the times things can be done
11921:30:35 <jynus> discuss *if* you want a final state or not
12021:30:39 <aude> Krinkle: maybe globally, then need to be qualified with interwiki prefix or something
12121:30:42 <TimStarling> ok, but I think you're jumping in without being fully up to date on the discussion
12221:30:49 <Krinkle> Okay, as long as it doens't require a larger hard drive or newer hardware just to migrate.
12321:30:49 <jynus> there is always a way
12421:31:01 <aude> to make globally unique
12521:31:03 <gwicke> I do wonder if it could be name to blob hashes, name to Q-item, and Q-item to blob hashes instead
12621:31:17 <jynus> I read the backlog
12721:31:55 <TimStarling> ok, can we talk about what you're proposing then?
12821:32:21 <TimStarling> convert image to image_revision -- not sure what this means
12921:32:53 <jynus> I am not proposing anything, I am asking why you want to duplicate the 'image' and 'oldimage' tables
13021:33:55 <jynus> here image_revision (in my mind) is the image table
13121:34:31 <Krinkle> jynus: The current tables have each row describe the object in full (with no primary keys). With 'image' holding the current revs and oldimage holding the non-current ones. one the new schemas proposed would involve the 'image' table no longer containing full descriptions (just a pointer to the current revision in image_revision) and both current and
13221:34:31 <Krinkle> non-current rows being in image_revision.
13321:34:57 <jynus> I know that
13421:35:15 <jynus> I just do not see why you want to duplicate them
13521:35:22 <TimStarling> it was just an idea
13621:35:30 <TimStarling> I'm happy to back off from it if you have a better idea
13721:35:37 <jynus> tell me about that idea
13821:36:30 <Krinkle> Firstly, the software would need to know which table to query, and whether to expect the old or new system in it during the migration.
13921:36:34 <jynus> I do not see how it fits proposal 1 or 2
14021:36:57 <jynus> ok, so you are not duplicating things
14121:37:13 <jynus> you just have 4 tables during the migration
14221:37:28 <Krinkle> jynus: one idea to migrate to the new schema was to create the newly schemad and named tables, import all the data, and drop/flip once finised.
14321:37:29 <Krinkle> Yes
14421:37:32 <jynus> which is ok
14521:37:41 <jynus> that doesn't need *duplicating data*
14621:37:49 <Scott_WUaS> legoktm: Re [14:20] <legoktm> As long as the tables have primary keys, I don't have an opinion :) ... what are the steps for anticipating orders of magnitude more primary keys ... say for modeling a) brain neurons and b) at the much smaller nano-level (in a hypothetical Google Street View with OpenSimulator, conceptually and some years' ahead, and for brain research? Thanks.
14721:37:54 <jynus> nothing against it
14821:38:01 <jynus> and it does not require extra resources
14921:38:34 <TimStarling> my idea was to do it in a similar way to how page/revision migration was done
15021:38:48 <Krinkle> jynus: Well, commonswiki would temporarily hold a significantly larger data set. Since we'd have essentially a copy of those two tables?
15121:39:01 <Krinkle> Or does mariadb have a way to import string and binrary values by reference?
15221:39:19 <jynus> but images would be on one format or another, not both, right?
15321:39:19 <TimStarling> that is, add a new table (say imagerevision) which has one row per image revision, but with very small rows
15421:39:52 <jynus> If what Tim says is #2, I like that better
15521:40:11 <jynus> "just adding PKs" would be very inneficient
15621:40:19 <Krinkle> jynus: No, the software would keep using the old tables until the new tables are completely ready for use (with a small read-only time to catch up)
15721:40:19 <jynus> and denoramized
15821:40:39 <jynus> no, I do not like that, Krinkle
15921:40:45 <Krinkle> Exactly.
16021:41:04 <jynus> we can do better
16121:41:07 <Krinkle> All of this is #2. We're just talking about how the migration would go.
16221:41:11 <TimStarling> the image table currently has things like img_height, img_metadata
16321:41:22 <jynus> we can migrate progresively
16421:41:25 <TimStarling> img_metadata in particular can be enormous
16521:41:37 <robla> #info <TimStarling> that is, add a new table (say imagerevision) which has one row per image revision, but with very small rows <jynus> [if this is proposal #2] I like that better
16621:41:41 <gwicke> ftr, I'm still not convinced that an image id would help us solve the changing-image-dimension or more generally image metadata caching problem
16721:41:55 <TimStarling> so you could introduce imagerevision which would just give a pointer to the location of the full image metadata
16821:42:12 <jynus> yes, and that would be relativelly small
16921:42:22 <Krinkle> TimStarling: What location is that?
17021:42:39 <TimStarling> initially it could point to the existing image and oldimage tables
17121:42:47 <jynus> I do not have 100% clear the fields
17221:43:00 <jynus> but I think those are details
17321:43:13 * gwicke heads out for lunch
17421:43:58 <jynus> I would push for a non-freezing migration- being temporarily compatible with both or filling it up in parallel
17521:44:11 <robla> so...we have three proposals in the current RFC, and option #2 is the one that this group seems to be the most interested in fleshing out. is that right?
17621:44:11 <Krinkle> TimStarling: Hmm I guess we could even not have the pointers if this file-metadata table becomes keyed by image_revision
17721:44:42 <TimStarling> potentially
17821:44:55 <TimStarling> it may be that I'm overcomplicating things
17921:45:00 <Krinkle> jynus: Yeah, but I don't think that is feasible due to the lack of the needed indexes and primary keys. I don't see a way to make the software maintain both in parallel.
18021:45:29 <Krinkle> Anyhow, let's do actually worry about migration later.
18121:45:39 <Krinkle> Let's continue about what direction we want to go in.
18221:45:39 <TimStarling> since the problem we were dealing with with cur/old was that the old table at the time had the actual text in it, something like 90% of our entire database and we didn't even have the disk space to duplicate it
18321:46:03 <Krinkle> Yeah, good point.
18421:46:06 <Krinkle> It predates ES
18521:46:16 <jynus> Krinle, remember that I am going to add an id to the watchlist table (which has not PK) in a hot way
18621:46:39 <jynus> so there are many tricks to do
18721:46:51 <jynus> what is better
18821:46:54 <Krinkle> Adding keys is simple imho. The software can be feature flagged even as to whether to create/query those.
18921:46:56 <robla> #info old cur/old->page migration happened before External Storage existed
19021:47:00 <TimStarling> I added an autoincrement PK to the logging table, it was a nuisance IIRC, but possible
19121:47:19 <jynus> I plan to reduce our database size magically: https://phabricator.wikimedia.org/T139055
19221:47:35 <jynus> that was my point about "do not worry too much about that"
19321:47:44 <jynus> I have you covered
19421:48:21 <Krinkle> If we go with option 2. What about file IDs? Would they be appropiate for use in APIs and user generated content? How would this go cross-wiki?
19521:48:28 <robla> #info jynus plans to employ InnoDB compression, discussed in T139055
19621:48:28 <stashbot> T139055: Test InnoDB compression - https://phabricator.wikimedia.org/T139055
19721:48:41 <TimStarling> Krinkle: you mean like https://www.mediawiki.org/wiki/User:NeilK/Multimedia2011/Titles ?
19821:48:57 <TimStarling> with file IDs used as the main user-visible identifiers?
19921:49:28 <TimStarling> [[File:38798231]] etc.
20021:50:01 <Krinkle> TimStarling: Yeah, I mean, initially I guess that wouldn't be relevant, since we'd still have file description pages, which have a unique wiki page name, and transclusion of files is indirectly behind resolving the page title first.
20121:50:23 <jynus> here file means, actual files that can be overrided by another revision or revision?
20221:50:24 <Krinkle> But it is something we may wish we had done differently if we don't think about it now.
20321:50:31 * robla looks in the backlog for gwicke's task number regarding file ids
20421:50:57 <TimStarling> I am a long way from convinced on this
20521:51:24 * gwicke returns with burrito
20621:51:28 <Krinkle> I'm okay with saying that file IDs will stay internal like we do with page IDs.
20721:51:29 <robla> T66214 is the task gwicke mentioned earlier
20821:51:29 <stashbot> T66214: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214
20921:52:08 <jynus> the more I think, the more this is non-trivial
21021:52:13 <Krinkle> it would however move the group of file name problems for a later RFC (need to have unique names at upload time, sometimes akward to use in wikitext, not stable, subject to redirects being deleted etc.)
21121:52:25 <jynus> purging caches? will it affect them?
21221:52:57 <Krinkle> jynus: The current option proposed here doesn't affect that in any way. However if we adopt stable file IDs, all that will becomem significantly easier, not harder.
21321:53:13 <jynus> (I need to to check the loging and how it is stored on db)
21421:53:20 <gwicke> jynus: those problems are what content hashes are designed to solve
21521:54:21 <TimStarling> Krinkle: this is mostly independent of the backend issue isn't it?
21621:54:27 <gwicke> from what I can tell, for most use cases listed in the present RFC, content hashes would actually have better caching properties
21721:54:29 <Krinkle> TimStarling: re "I am a long way from convinced on this" - can you elaborate? I'd agree that there doesn't seem to be an easy solution to having a unique file ID that is stable and yet usable across wikis.
21821:54:37 <TimStarling> we can add a UI to access file IDs later if we want to
21921:54:49 <Krinkle> Yeah, totally, we don't need to tackle that at all.
22021:55:04 <robla> we're nearing the end of our time. we started the meeting with 3 proposals in the RFC, and option 2 seems the most viable at the moment. is that right?
22121:55:07 <Krinkle> The only that would become exposed is file revision ID, not file ID.
22221:55:16 <Krinkle> Since we'd use that instead of the current (file name, timestamp) tuple.
22321:55:21 <Krinkle> in APIs
22421:55:42 <Krinkle> TimStarling: Right?
22521:55:49 <TimStarling> yes
22621:55:53 <Krinkle> (and that tuple is already fragmented by wiki, so no concerns there)
22721:56:38 <jynus> revision id would be unique, as with page revisions
22821:56:38 <Scott_WUaS> sounds good, robla:
22921:56:41 <Krinkle> TimStarling: I suppose link tables will also stay the same, for the same reason we changed them to be based on titles instead of page ids.
23021:57:06 <gwicke> wait, would this be for the file, or the page description?
23121:57:08 <Krinkle> jynus: Yeah, but unlike file IDs (which would need to make sense from another wiki), file revisions are never referenced without a wiki context.
23221:57:20 <jynus> yes
23321:57:23 <TimStarling> URLs make sense in a global context
23421:57:27 <jynus> as in, you are right
23521:57:31 <TimStarling> autoincrement IDs, not so much
23621:57:38 <TimStarling> UUIDs could work
23721:57:46 <TimStarling> you could assign a 128-bit random ID or something
23821:58:09 <TimStarling> but it's a discussion for another day
23921:58:26 <Krinkle> gwicke: If we adopt something better than user-generated page names for public file identifiers in the future, the file page would presumably use that same ID in its title.
24021:58:46 <jynus> I think there is generaly suppor for that, but maybe this need to mature a bit more with other outstanding issues related?
24121:58:52 <gwicke> would the ID identify a blob, or a description of a blob?
24221:59:05 <Krinkle> But I agree with Tim that we should leave that as-is for the purpose of this RFC (we keep using file "titles" as the file page and transclusion name for instant commons etc.)
24321:59:33 <TimStarling> we need to support titles anyway, at least for b/c
24421:59:36 <robla> Krinkle: any decisions you hope to make sure we document here?
24521:59:44 <TimStarling> so I don't think we're boxing ourselves in by considering it later
24622:00:34 <Krinkle> gwicke: The file ID I theorise here would identify the file as mutable entity. a revisioned wiktext description page would use that ID in its page name. (e.g. File:12345)
24722:00:44 <Krinkle> But anyway, issues. Let's keep that outside the scope for now.
24822:00:51 <robla> I'm going to summarize this as "Krinkle updated the RFC to give 3 options, we discussed them, and #2 seems to be the current favorite"
24922:01:01 <Krinkle> LGTM.
25022:01:25 * robla is going to hit #endmeeting in a couple of minutes; furhter discussion welcome on #wikimedia-tech
25122:01:27 <TimStarling> option #1 is now excluded, but option #3 was not really discussed
25222:01:58 <Krinkle> I'll update the task and call for fleshing out the details with regards to the exact schema (separate image meta data? any other columns to be dropped?) and how to migrate there (rename and mutate, or re-create, or both for a while)
25322:02:17 <Scott_WUaS> Krinkle: "the file as mutable entity" - great ... and the primary key too?
25422:02:18 <jynus> I have some optinion about #3- I am not against it, but I have the fear of it being too wide to potentially block all
25522:02:27 <TimStarling> Krinkle: sounds good
25622:02:30 <Krinkle> I'd also like to work with gwicke on fleshing out the MCR option and see if that is feasible in the short-medium term.
25722:02:36 <jynus> "lets no do this because #3 will fix it in the end"
25822:02:39 * robla will end meeting in 30 seconds
25922:02:55 <jynus> I do not think it is exclusive to advance in some areas an other in parallel
26022:03:02 <TimStarling> I may possibly be able to talk with jynus a bit longer in another channel (since it is school holidays)
26122:03:07 <gwicke> I would like to be a bit more precise about which problems are in scope & which aren't
26222:03:34 <Krinkle> Scott_WUaS: The primary key could be anything. UUID, integer, hash of sorts, not a concern imho. And could be migrated from later. Certainly not unique to this particular table.
26322:03:37 <gwicke> right now it seems tricky to evaluate options without having that tied down
26422:03:37 <Scott_WUaS> gwicke: new RFCs?
26522:03:42 <robla> ok...further discussion welcome on #wikimedia-tech thanks all
26622:03:54 <robla> #endmeeting

Update and call for discussion on wikitech-l:

https://lists.wikimedia.org/pipermail/wikitech-l/2016-August/086283.html
https://www.mail-archive.com/wikitech-l@lists.wikimedia.org/msg85607.html

Open questions

1. Which fields do we keep in the 'image' table (img_id, img_name, img_latest, anything else?).

All fields currently being queried from both tables, will probably only
stay in the image revision table. But there are a few fields that we
intentionally only want to query about current versions. For example
'img_sha1'. For duplicate-detection, we need to only consider the latest
revisions. Doing this by keeping img_sha1 means uploading a new revision
will involve updating two fields instead of one (img_latest and img_sha1).
This isn't unprecedented as we do this for page as well
(WikiPage::updateRevisionOn; page_latest, page_touched, page_is_redirect,
page_len).

Are there other fields we need to keep besides img_sha1? Or should we can
solve the img_sha1 use case in a different manner?

2. img_metadata

This field is a blob of serialised PHP (typically representing the Exif
data of an image).

Tim (correct me if I got it wrong) mentioned we could potentially make
migration easier by changing img_metadata to be stored in a separate table
and change the img_metadata field (in the image revision table) to instead
be a pointer to a primary key.

This could potentially be done separately later, but if it helps migration,
we should consider doing it now.

How will this interact with file deletion? Will it be difficult to garbage
collect this? Do we need to? (We don't seem to do it for the 'text' table /
external store; is it worth moving this an external store?)

3. Migration

If we rename both tables (image/oldimage -> file/filerevision), we'd have
the ability to run migration in the background without interfering with the
live site, and without requiring a long read-only period and/or duplicate
and additional code complexity to be developed.

Is there a way we can do the migration without creating two new tables?
Using the oldimage table as import destination for current rows isn't
straight forward as existing scan queries would need to skip the current
rows somehow while in the midst of this migration. Seems possible, but is
it worth the complexity? (We'd need extra code that knows about that
migration field, and how long do we keep that code? Also complicates
migration for third-parties using update.php).

Is creating the new tables separately viable for the scale of Wikimedia
Commons? (and dropping the old ones once finished). Is this a concern from
a DBA perspective with regards to storage space? (We'd temporarily need
about twice the space for these tables). So far I understood that it
wouldn't be a problem per se, but that there are also other options we can
explore for Wikimedia. For example we could use a separate set of slaves
and alter those while depooled (essentially using entirely separate set of
db slaves instead of a separate table within each slave).

Do we create the new table(s) separately and switch over once it's caught
up? This would require doing multiple passes as we depool slaves one by one
(we've done that before at Wikimedia). Switch-over could be done by
migrating before the software upgrade, with a very short read-only period
after the last pass is finished. It wouldn't require maintaining multiple
code paths, which is attractive.

Other ideas?

TechCom suggests this RFC for discussion at E266: ArchCom RFC Meeting W35: image and oldimage tables (2016-08-31, #wikimedia-office). Hopefully, we can discuss most of the thorny issues in the comments here, and then clarify at the IRC meeting. In particular, @jcrespo and @Krinkle, assuming timezones work for you, please have informal IRC conversations about this. It'll be especially appreciated if you drop the logs of those conversations in this task (or even just a pointer to the logs of the discussion venue you used).

Since @jcrespo commented on wikitech-l, I'm going to quote his August 11 email:

Mediawiki migration could be done in any form or way; the easiest path
for a live migration is to keep the image table; add the logic to ignore
older revisions on the same table; add the oldimage entries, which are
very few in comparison; then create the non-revision version; migrate
the bulk of the code functionality; then drop unnecesary columns. That
was discussed at the time, although I am not sure it was all recorded.

Duplicating commons infrastructure is not viable with the hardware we
have now.

Modifying the image table is not "free", specially for Commons, but having
the image,oldimage pattern is already causing issues to our editors.

Krinkle raised the priority of this task from Low to High.Sep 7 2016, 8:19 PM
Krinkle moved this task from Request IRC meeting to Under discussion on the TechCom-RFC board.

Summary of second IRC meeting about image/oldimage migration (E266).

(For the summary of the first RFC meeting, see T589#2463446.)

This second meeting deals with the remaining three open questions at T589#2541630.

  • Which solution to pursue?
    • Agreement to shelve the idea to manage upload history using MCR (Multi-content revisions). Only discussing the two-table file / file_revision option for now.
  • Question 1: What fields to keep unversioned in the image table?
    • Minimal set: img_id, img_name, img_latest.
    • Currently indexed: img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime.
    • img_user_text: Remove. For upload contributions, replaced by a filerevision index.
    • img_sha1: Keep. For duplicate detection, applies to latest only.
    • img_timestamp: Remove. Mostly used on oldimage, not image. image usage for pagination replaced by ids.
    • img_media_type/img_major_mime/img_minor_mime: TBD. Preference is to remove, but currently used by <Special:MIMESearch>. Could be implemented on top of SearchEngine instead. Depending on short-term feasibility, we may want to keep for now and remove later.
  • Question 2: What about img_metadata?
  • Question 3: Schema migration

@jcrespo and @tstarling support using a custom table view.

In my case, more like "let's get rid of old references, but in case of necessity, we can use views as a last resort aiming for backwards compatibility, despite introducing potential challenges (they are updatable as long as they have a 1:1 relationship with the underlying table)"

@jcrespo and @tstarling support using a custom table view.

In my case, more like "let's get rid of old references, but in case of necessity, we can use views as a last resort aiming for backwards compatibility, despite introducing potential challenges (they are updatable as long as they have a 1:1 relationship with the underlying table)"

I think we may be misunderstanding each other. As far as I'm concerned we won't have any state during which we interact with both the new and old tables (other than the seconds during which the scap operation runs).

We typically only support 1 database version in MediaWiki. We'll prep the breaking change in the master branch. Any extensions that interact with the table directly, will have to update their code within the same 6-month release cycle. We'll update Wikimedia-deployed extensions as soon as possible (sometime before the next branch cut) so that we can continue to deploy from master once a week.

The migration we're talking about here is not the period during which developers update their code. From a cluster perspective, that period will not exist. This migration is about running the schema change in a way that we can keep the existing code as-is, and once the schema change is done (which we expect will take several days), we'll switch to the new code at once.

This means, during the migration, the new tables is unused as all read/write traffic goes to the old table. Once the schema change is done, we deploy the next version of MediaWiki.

This is just one way of doing it, there are other ways and that's why we're discussing it here :)

@jcrespo Is using a view suitable for this particular way of migrating? Or should we do that in a way that doesn't involve custom view? Or perhaps we can utitlize the custom view if we migrate in a different way?

The migration we're talking about here is not the period during which developers update their code. From a cluster perspective, that period will not exist. This migration is about running the schema change in a way that we can keep the existing code as-is, and once the schema change is done (which we expect will take several days), we'll switch to the new code at once.

Yes, that is my understanding also.

This means, during the migration, the new tables is unused as all read/write traffic goes to the old table. Once the schema change is done, we deploy the next version of MediaWiki.

I was imagining a scheme like the following:

  • Rename image to filerevision
  • Create a view called image which reads from the underlying filerevision
  • Add new fields to filerevision
  • Create the file table
  • Disable image uploads
  • Run a script which populates file from filerevision and updates filerevision with suitable links to file
  • Copy all rows from oldimage to filerevision (but the image view somehow filters out these copied rows)
  • Deploy the new MW version; start using file/filerevision
  • Re-enable image uploads
  • Drop table oldimage
  • Drop view image

The view would be only useful to maintain compatibility regarding extension, I would be almost certain they would not work for the migration process itself I do not see how a view will be useful in this case (views have trouble with indexes, because they don't have them, only the underlying table has them):

  • Rename image to filerevision
  • Create a view called image which reads from the underlying filerevision
  • Add new fields to filerevision
  • Create the file table
  • Disable image uploads
  • Run a script which populates file from filerevision and updates filerevision with suitable links to file
  • Copy all rows from oldimage to filerevision (but the image view somehow filters out these copied rows)
  • Deploy the new MW version; start using file/filerevision
  • Re-enable image uploads
  • Drop table oldimage
  • Drop view image

Something like:

  • Write some sanity check scripts for current and old structure "is this consistent?"
  • Make sure mediawiki that ignores extra columns on image
  • Add new fields to image
  • Create empty file table
  • Deploy new version of mediawiki that starts maintaining file and image table (new cols, rows) for new events
  • Copy all rows from oldimage to image
  • Run a script which populates file from image
  • Run sanity checks
  • Deploy the new MW version using the new structure at the same time that we rename image to filerevision
  • Drop table oldimage
  • Drop old unused fields

I know this requires an extra deployment, but I am not sincerely sure a view will work in the above case (still doing the right query plans), but if you do the work checking it, and prefer avoid deploying an intermediate software stage, compatible with the previous, but using the new one, I can help testing it. Remember I thought you wanted views for extension compatibility, that could work- I am not sure for core functionality, I prefer the extra deployment.

I also do not like disabling uploads for an extended time, instead of creating backwards compatibility and then backfill, which is more or less my strategy. Some devs here have done things like this in the past, at a smaller scale, I think, maybe we can ask for tips.

The other issue is that there is too much copying on the original proposal- that is challenging for many reasons- copying rows in such a huge table will be either slow, or cause lag, or both (performance team would not be happy).

It is not that I do not want to do much DBA work, it is that it creates IO stress, which normally the users notice. I do copies, but of very small tables (oldimage), and try to disrupt users as little as possible, plus renames should be almost instant. In exchange, there is an extra release, but that is half of the final code, it is not like a complete new version has to be created. 0:-)

I am probably not right here, but I would like you to at least consider such an alternative model, and have into account the potential implications of yours.

  • img_media_type/img_major_mime/img_minor_mime: TBD. Preference is to remove, but currently used by <Special:MIMESearch>. Could be implemented on top of SearchEngine instead. Depending on short-term feasibility, we may want to keep for now and remove later.
  • img_media_type/img_major_mime/img_minor_mime: TBD. Preference is to remove, but currently used by <Special:MIMESearch>. Could be implemented on top of SearchEngine instead. Depending on short-term feasibility, we may want to keep for now and remove later.

Perhaps we should kill img_media_type (that's pretty useless), but having the mime in the db has proven useful in the past where we had to debug things (e.g. only affects this type of image with this type of metadata)

Its also used by Special:MediaTypeStatistics

I've updated https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables to reflect the latest state of this proposal based on the second IRC conversation (October 2016) and this weeks' meeting of the Architecture Committee.

The only remaining unresolved question was around whether to keep or remove the media type and mime time columns. Our current recommendation is to keep them (due to existing usage) and treat it as an orthogonal issue for later. We recognise the overhead of continued maintenance for this index, but those are not in scope of this RFCs problem statement and isn't a requirement to fulfilling the presented use cases.

Perhaps we should kill img_media_type (that's pretty useless)

Since it's now exposed via Cirrus, it allows me to search for "video". That's pretty nice. We don't need it in the table really, but the search engine has to get it from *somewhere*. it can be computed from the file data itself, but not in all cases from meta-data (ogg files need to be inspected to find out if they are video, for instance). Since we currently have no other place to store this info, it should probably stay in the table for now.

Perhaps we should kill img_media_type (that's pretty useless)

Since it's now exposed via Cirrus, it allows me to search for "video". That's pretty nice. We don't need it in the table really, but the search engine has to get it from *somewhere*. it can be computed from the file data itself, but not in all cases from meta-data (ogg files need to be inspected to find out if they are video, for instance). Since we currently have no other place to store this info, it should probably stay in the table for now.

Its not calculated correctly as it stands (webm file with only audio tracks are videos, ogg files with metadata mentioning the word theora in metadata are videos even if they are not, etc)

Since the last call email went out: Do we have an estimate about how muchtime images are going to be read only during the migration on commons? If we are talking several hours im sure people wont mind. If we are talking several days the commons community will be seriously unhappy

The RFC could use a more detailed problem statement, which would help to establish a scope.

https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables#Problems is helpful, but I'll note that it's been many months since I last read an RfC presenting such strong use cases:

The result would be:

  • oldimage and filearchive table can be removed
  • full integration of file uploads with the page history, including patrolling, reverts, undeletion, etc
  • image table is still available for queries
  • file metadata would become available in XML dumps.

Thanks for making it clear what the benefits would be for users.

These are 2 ancient rules regarding databases (that I have just made up) regarding database design:

:-)

Krinkle renamed this task from RfC: image and oldimage tables to Schema change: image and oldimage tables.Feb 8 2017, 5:32 PM
Krinkle closed this task as Resolved.
Krinkle reopened this task as Open.
Krinkle removed Krinkle as the assignee of this task.
Krinkle lowered the priority of this task from High to Medium.
Krinkle removed a subscriber: wikibugs-l-list.
Krinkle renamed this task from Schema change: image and oldimage tables to RFC: image and oldimage tables.Feb 8 2017, 5:34 PM
Krinkle closed this task as Resolved.
Krinkle claimed this task.
Krinkle raised the priority of this task from Medium to High.

With this RFC resolved, the implementation may proceed (filed as T28741).