Page MenuHomePhabricator
Paste P3958

ArchCom-RFC-2016W35-irc-E266.txt
ActivePublic

Authored by RobLa-WMF on Sep 1 2016, 12:00 AM.
21:04:09 <robla> #startmeeting oldimage/image discussion T589
21:04:09 <wm-labs-meetbot`> Meeting started Wed Aug 31 21:04:09 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:04:09 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:04:09 <wm-labs-meetbot`> The meeting name has been set to 'oldimage_image_discussion_t589'
21:04:09 <wm-labs-meetbot> Meeting started Wed Aug 31 21:04:09 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:04:09 <wm-labs-meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:04:09 <wm-labs-meetbot> The meeting name has been set to 'oldimage_image_discussion_t589'
21:04:09 <stashbot> T589: RfC: image and oldimage tables - https://phabricator.wikimedia.org/T589
21:04:18 <robla> #topic Wikimedia meeting channel | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
21:04:51 <robla> hi everyone!
21:05:09 <jynus> hi
21:05:29 <Scott_WUaS> hi All!
21:06:52 <brion> hi
21:06:58 <jynus> I am a bit confused by this topic, I think most people agreed in general about the idea; implementation details could be filled and discussed offline?
21:07:13 <ostriches> Oh herp derp I'm here :)
21:07:23 <brion> it may be a short discussion then :D
21:07:25 * ostriches puts on his releng hat
21:07:48 <greg-g> wait, it was off the rest of the day?
21:08:12 <TimStarling> have you seen the comment Krinkle wrote after the last meeting?
21:08:17 <jynus> specially because I didn't see any were concerns, just a question
21:08:20 <robla> jynus: we found that you were giving a lot of great feedback. Krinkle was hoping for more input on release strategy
21:08:25 <ostriches> greg-g: I was wearing https://www.amazon.com/Adult-Propeller-Beanie-Hat-Made/dp/B001QK4RZC instead
21:08:29 <Krinkle> Hi
21:08:44 <Krinkle> So the main questions for this meeting are written here - https://phabricator.wikimedia.org/T589#2541630
21:08:46 <jynus> no, that's a mistake; stop listening to what I say! :-)
21:08:51 <Krinkle> These were also on wikitech-l
21:08:57 <Krinkle> 3 open questions.
21:09:26 <Krinkle> robla: Shall I start with the first one?
21:09:31 <robla> sure
21:10:22 <Krinkle> So before I do, I want to help set the stage and clarify any misunderstanding about the scope of this RFC. This was raised in the previous IRC meeting. We cleared it up at the time but not everyone may be caught up with that. https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables#Problems
21:10:41 <Krinkle> We've reduced the scope to two core problems we want to address.
21:10:48 <Krinkle> 1. File revisions should have better unique identifiers than "current file title + timestamp".
21:10:54 <Krinkle> 2. Uploading file revisions must not involve rows moving across tables, or rows being replaced.
21:11:26 <Krinkle> The first one is primarily from an architecture perspective to simplify a lot of our internal classes and API consuming, as well as to remove conflicts in race conditions.
21:11:55 <Krinkle> The second one is because of an anti-pattern in our database schema that jynus wants us to get rid of (and I wholeheartly agree)
21:12:08 <jynus> 2 is to stop commons from exploding like it happens every 6 months
21:12:19 <Krinkle> :)
21:12:29 <TimStarling> jynus: exploding as in replication failure?
21:12:29 <jynus> and "make Labs great again"
21:12:44 <jynus> yes, TimStarling
21:13:11 <jynus> it has to be a special case for it to happen in production, but it definitely happens on labs
21:13:22 <tgr> how is this related to multi-content revisions? it seems like a plan to refactor something that will need to be refactored again not much later
21:13:23 <jynus> like a master failover
21:14:19 <Krinkle> tgr: I think Multi-content revisions is for the future right now. It's too big a step to move into right away. However once MCR is stable and used elsewhere, by doing this RFC, we'll be in a position where migrating to MCR is fairly straight forward.
21:14:32 <jynus> Krinkle, +1
21:15:06 <TimStarling> question 1 was what fields to keep in image?
21:15:09 <DanielK_WMDE> yea, what Krinkle said... is discussing the MCR option in scope of this meeting? I like the idea, but it would probably mean shelving this until we have MCR, and have some experience with it (let's say, in a year).
21:15:29 <Krinkle> Yeah, we more or less shelved MCR for file tables in the last meeting.
21:15:36 <Krinkle> But it's a good question to ask.
21:15:54 <jynus> I think it is worth it, once I propose a solution for 3; independently of the status of MCR
21:15:54 <DanielK_WMDE> migrating twice is extra work, but this sounds urgent, going for the more immediate options seems wise for now.
21:16:21 <Krinkle> TimStarling: Yeah. So question 1: Given that we'll put all image revisions in 1 table (instead of one for current and one for older ones), which fields do we want to keep in our generic 'image' table (the equivalent of 'page' basically).
21:16:48 <TimStarling> I notice that the following fields have indexes: img_name, img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime
21:16:56 <Krinkle> The minimal set of fields would be: img_id, img_name, img_latest
21:17:08 <AaronSchulz> DanielK_WMDE: urgent?
21:17:38 <TimStarling> img_sha1 is used to provide duplicate upload warnings
21:17:52 <DanielK_WMDE> AaronSchulz: if i understood correctly, jynus sais commons blows up because of data being moved between tables
21:18:29 <AaronSchulz> "blows up" could be clarified
21:18:30 <DanielK_WMDE> AaronSchulz: the point being: the problems is urgent. while managing file revisions in mcr is a nice idea, it's blocked for a while. so let's do something else for now
21:18:33 <TimStarling> it's probably still needed for that purpose
21:18:40 <Krinkle> Yeah, so we'll need to keep that in sync. This is not unprecedented as we do this for page as well (page_touched, page_is_redirect, page_len) whenever we update page_latest. But ideally fewer is better I guess.
21:18:42 <jynus> AaronSchulz, https://wikitech.wikimedia.org/wiki/Incident_documentation/20160705-commons-replication
21:19:06 <Krinkle> We also have a query page (special page and API) for finding files by mime type.
21:19:08 <AaronSchulz> isn't that mostly just INSERT SELECT
21:19:11 <AaronSchulz> ?
21:19:21 <TimStarling> the img_user_text image is for contributions, that will be replaced by a filerevision index
21:19:22 <jynus> AaronSchulz, yes
21:19:25 <gwicke> I would still like to see content-based image ids (hashes) rather than other random ids if that's feasible to do while we are at it
21:19:29 <TimStarling> s/image/index/
21:19:46 <Krinkle> gwicke: For the revisions?
21:19:53 <TimStarling> the img_media_mime is a joke, do we seriously use that?
21:19:59 <gwicke> yes, for each image original
21:20:09 <Krinkle> gwicke: We do currently have duplicates, however.
21:20:22 <jynus> gwicke, do not we have already a hash column?
21:20:22 <Krinkle> And given we are not separating content from revision right now, that wouldn't be semantically possible.
21:20:34 <AaronSchulz> jynus: we only do that for moving the current to the oldimage table and the current to the archive table (e.g. 1 row). The others now do the SELECT in the app.
21:20:35 <DanielK_WMDE> #info agreement to shelve the idea to manage upload history using MCR. Discussing the two-table file / file_revision option for now.
21:20:42 <Krinkle> since after a rollback it'd be the same content, but a different timestamp+user.
21:21:10 <DanielK_WMDE> Krinkle: how does the current proposal support revision deletion? (do we support rev-deletion for images at all right now?)
21:21:25 <gwicke> Krinkle: documenting those considerations on the RFC might be useful
21:21:32 <AaronSchulz> jynus: https://gerrit.wikimedia.org/r/#/c/307078/ will banish even those
21:21:42 <gwicke> for later reference, as well
21:22:26 <Krinkle> gwicke: Even if we separate them later, we still need a revision ID. Which is the first step (this step).
21:22:56 <DanielK_WMDE> TimStarling: the original idea was to allow people to search for "videos" or "bitmaps". never happened, should be done with elastic now. img_media_type can die in the db. the concept may still be useful in code, even if we just feed it to the search engine.
21:23:05 <jynus> AaronSchulz, I say thanks, but why moving things in the first place?
21:23:05 <gwicke> unless we decide that having two different copies of the same thing & under possibly different licenses isn't really useful
21:23:17 <TimStarling> Special:MIMEsearch.php mentions that the index is wrong
21:23:21 <DanielK_WMDE> SMalyshev: does cirrus index media type (not mime) for files?
21:23:22 <gwicke> but, agreed, that brings up a whole can of worms
21:23:22 <tgr> we do support rev-deletion for images, but that just updates the value of a bitfield, does not seem problematic
21:23:26 <Krinkle> DanielK_WMDE: Elastic specfiic? Or possible in mysql default search too?
21:23:26 <TimStarling> it doesn't contain a sorting field
21:23:27 <AaronSchulz> jynus: oh, the moving still sucks, I'm just saying the urgent part isn't apparent to me
21:23:38 <tgr> DanielK_WMDE: do you mean normal deletion of single revisions?
21:23:46 <DanielK_WMDE> tgr: single revision.
21:23:48 <AaronSchulz> I'm all on board for refactoring to be have filerevision
21:23:50 <jynus> AaronSchulz, I want things fixed, I do not matter how
21:23:56 <SMalyshev> DanielK_WMDE: hmm not sure
21:24:02 <Krinkle> I think he means 'revision deletion' (hiding of revisions within the same table, like rev_deleted)
21:24:06 <ostriches> DanielK_WMDE, SMalyshev: I don't believe we do
21:24:10 <SMalyshev> I don't think right now a lot is indexed for files
21:24:17 <SMalyshev> let me check
21:24:18 <Krinkle> And yes, oi_deleted exists.
21:24:31 <DanielK_WMDE> Krinkle: we have a field-based system for feeding the search engine now. core could expose the field, but the default (sql) search engine would just ignore it for now
21:24:32 <TimStarling> that would ideally need to be integrated with the SearchEngine interface in core
21:24:43 <jynus> AaronSchulz, I wasn't the one saying it was urgent; however, I think MCR is a bit ambitious to block lots of small changes on it
21:25:20 <DanielK_WMDE> TimStarling: SMalyshev has recently implemented an interface that could be used to do that I think
21:25:41 <ostriches> Sidenote: SearchEngine & friends are terrible and need a lot of cleanup.
21:25:44 <ostriches> (never got to that)
21:25:45 <Krinkle> Current fields we are discussing: https://www.mediawiki.org/wiki/Manual:Image_table
21:25:47 <AaronSchulz> jynus, DanielK_WMDE: I just want us not to feel a need to rush it :)
21:25:55 <tgr> Krinkle: that's being done via img_deleted / oi_deleted / fa_deleted currently, does not seem hard to migrate
21:26:13 <TimStarling> so let's suppose that Special:MIMEsearch can be deleted from core
21:26:28 <TimStarling> that would let us remove that index
21:26:29 * gwicke would be very interested in a summary outlining how this will integrate with the longer term image naming roadmap, including things like content based addressing
21:26:29 <Krinkle> tgr: Yeah, oi_deleted is just a revision property. That'll migrate just fine and isn't part of the generic image object.
21:26:32 <tgr> but normal deletion does move rows to filearchive (and page revisions work identically), do we intend to change that?
21:26:35 <DanielK_WMDE> AaronSchulz: well, if it's ok to push this back for a year, I'm all for the MCR option :)
21:26:35 <SMalyshev> DanielK_WMDE: I don't think for files we index more than text
21:26:38 <Krinkle> (the same way there is no rev_del in the page table)
21:26:39 <jynus> the only thing related to fields I do not have clear is the whole old-revision-handling
21:26:54 <Krinkle> tgr: No, archiving is not in the scope of this RFC.
21:26:57 <DanielK_WMDE> SMalyshev: there is some special case code for file pages, let me find it...
21:26:57 <jynus> does that require some tuning on indexes, extra fields, etc?
21:27:25 <SMalyshev> DanielK_WMDE: we could of course add more fields, shouldn't be hard
21:27:52 <jynus> do we have currently issues with lots of old revisions of the same file?
21:28:00 <Krinkle> gwicke: I'd say the outline is that it'll be a whole lot easier once we have primary keys for images and their revisions. It's mostly just technical debt that makes everything easier after it.
21:28:30 <ostriches> jynus: For some files, yes. Most files have very few revisions.
21:28:34 <ostriches> So, long tail.
21:28:45 <Krinkle> We should think about those things a little, but not much imho as there isn't much we could do wrong or would different in that perspective. We're just moving the table around basically.
21:28:47 <jynus> from what I saw, there were relatively very few old revisions
21:28:55 <Krinkle> Yeah, that surprised me too.
21:29:06 <Krinkle> I expected there to be more old revisions than current revisions.
21:29:12 <Krinkle> But we didn't include filearchive here
21:29:13 <AaronSchulz> meh, I've known that for a while
21:29:16 <ostriches> That doesn't surprise me.
21:29:20 <SMalyshev> DanielK_WMDE: though I wonder if there's intersection between this and structured commons work...
21:29:26 <ostriches> The vast majority of files get uploaded and never edited again
21:29:28 <Krinkle> yeah, it makes sense that most images don't have multiple revisions.
21:29:32 <AaronSchulz> I remember when I added paging to the file history due to big history files
21:29:44 <ostriches> On the other hand, there's some images that get edited all the time (think SVGs with routinely updated data)
21:29:45 <DanielK_WMDE> SMalyshev: ah, right, file_text is just one field. there's a few more useful things we could index for use on commons, i think. we could supper type:video or resolution:>800
21:29:51 <AaronSchulz> I don't recall any scaling issues we are facing right *now* aside from moving all those files around
21:29:53 <DanielK_WMDE> *support
21:30:00 <AaronSchulz> but that's not affected by this work
21:30:14 <tgr> jynus: there are some edge cases, like the image that the smoke tests update every day, but they are extremely rare
21:30:15 <ostriches> DanielK_WMDE, SMalyshev: Yeah, we never stuffed a ton of data about files into Elastic originally
21:30:24 <ostriches> Mainly file_text so we could search inside of pdf/djvu files
21:30:24 <Krinkle> So back to the original question :) Aside from img_latest and img_sha1 (for duplication detection). Any other fields we need to keep?
21:30:26 <AaronSchulz> I suppose a pdf with many revisions could hurt
21:30:29 <DanielK_WMDE> SMalyshev: i see no overlap with structured commons work tbh. it's orthogonal
21:30:32 <SMalyshev> DanielK_WMDE: it should be pretty easy to do now that he have infrastructure to add fields ieasily
21:30:35 <AaronSchulz> or djvu...all those big blobs of text
21:30:42 <Krinkle> If we don't keep img_media_type we need to migrate Special:MIMESearch first. Do we want to block on that? How trivial would it be?
21:30:43 <AaronSchulz> in oi_metadata
21:30:50 <DanielK_WMDE> SMalyshev: yay! make a ticket?
21:31:03 <Krinkle> Will we be able to make that special page work with default mysql search?
21:31:20 <SMalyshev> DanielK_WMDE: and we have also much easier keyword adding in Cirrus, btw - it's not proper structure instead of one humngous switch
21:31:24 <Krinkle> Or do we want to remove it from core or somehow make disable for simple backends?
21:31:30 <jynus> can I be pedantic and suggest, as usual, not to store in text things that only have a few values?
21:31:42 <jynus> like mime types
21:31:49 <ostriches> Krinkle: Not easily with the way things are right now.
21:31:54 <SMalyshev> DanielK_WMDE: so yeah I think making a ticket and figuring out what we want to put there won't hurt
21:31:55 <TimStarling> I doubt there is anyone who really relies on that special page
21:31:56 <ostriches> I would be +10000 to removing it from core.
21:31:57 <Krinkle> jynus: It's currently ENUM
21:32:03 <jynus> Krinkle, good!
21:32:09 <TimStarling> /**
21:32:09 <TimStarling> * The index is on (img_media_type, img_major_mime, img_minor_mime)
21:32:09 <TimStarling> * which unfortunately doesn't have img_name at the end for sorting.
21:32:09 <TimStarling> * So tell db to sort it however it wishes (Its not super important
21:32:09 <TimStarling> * that this report gives results in a logical order).
21:32:23 <TimStarling> this is a special page which gives you images of a given type, in random order
21:32:29 <gwicke> Krinkle: one question that could have an impact on how we want to evolve things is which metadata should be associated with the image itself, and which should be associated with a name + description
21:32:39 <SMalyshev> DanielK_WMDE: byw there's a big reindex coming IIRC sometime soon... so maybe worth discussing with all search team if we want to add stuff before it
21:32:54 <Krinkle> We also have https://commons.wikimedia.org/wiki/Special:MediaStatistics
21:32:56 <SMalyshev> DanielK_WMDE: because adding stuff after would take long time to become useful
21:33:01 <Krinkle> Which is arguably less random than MIMESearch
21:33:14 <TimStarling> can we go to question 2?
21:33:42 <Krinkle> Yeah, time is moving. OK. So mime remains an open question for later. No other fields?
21:34:12 <Krinkle> question 2: What about img_metadata field. It's fairly large. Do we just move it to the new imagerevision? (like it is now). Or differently?
21:34:26 <ostriches> Krinkle: MIMESearch is basically only useful for the less-used mimetypes :)
21:34:31 <Krinkle> This field is a blob of serialised PHP (typically representing the Exif data of an image).
21:34:42 <ostriches> Like, who on earth is doing [[Special:MIMESearch/image/jpeg]]
21:34:44 <ostriches> ;-)
21:34:44 <gwicke> it's pretty horrible for djvu files especially
21:34:48 <gwicke> megabytes of xml
21:34:53 <AaronSchulz> Krinkle: maybe external store support?
21:35:00 <gwicke> caused some outages
21:35:06 <TimStarling> EXIF data is usally pretty small, the problem is that on wikisource it stores OCR text of enormous scanned books
21:35:10 <AaronSchulz> I mean if we still have filearchive, we don't want to be moving those around even after a big refactor
21:35:13 <robla> #info Krinkle: question 2: What about img_metadata field. It's fairly large. Do we just move it to the new imagerevision? (like it is now). Or differently?
21:35:17 <AaronSchulz> that would be a missed opportunity
21:35:49 <ostriches> It should've been normalized when Brian redid our Exif/etc support a few years back, we just didn't get to that point.
21:35:56 <AaronSchulz> TimStarling: yeah, tiny ones should be able to stay, and a flag field could say if it use ES or something perhaps
21:36:12 <Krinkle> I'd like to avoid another 'text' table.
21:36:25 <TimStarling> well, we have metadata upgrade which means we need to load it all the time
21:36:54 <gwicke> this comes back to my earlier question about which metadata should be associated with the image itself, vs. the name -- exif data and OCR XML look more like a property of the image itself
21:36:54 <TimStarling> we probably need to load it for other things...
21:37:10 <Krinkle> gwicke: image revision itself?
21:37:17 <gwicke> no, image
21:37:19 <AaronSchulz> we load it for thumbnails I think
21:37:31 <gwicke> the same image under different names would still have the same OCR XML
21:37:41 <tgr> just use blob storage for the extracted text?
21:37:42 <AaronSchulz> might be nice to have metadata that can point to more metadata
21:37:50 <Krinkle> 'image' = equivalent to page, not image content. There is no image content right now.
21:37:51 <AaronSchulz> so thumbnails could render from the minimal metadata
21:37:53 <gwicke> or jpg metadata, for that matter
21:37:54 <tgr> that's going to be its eventual fate anyway
21:37:58 <AaronSchulz> and pages of djvu from the rest
21:38:04 <Krinkle> gwicke: One image with two revisions needs two meta data blobs
21:38:21 <gwicke> to record user / timestamp / comment, yeah
21:38:24 <Krinkle> It can't be in 'image'.
21:38:26 <gwicke> but the XML etc can be unchanged
21:38:34 <jynus> do we currently store "metadata" for old revisions, it is not clear to me?
21:38:37 <AaronSchulz> and the minimal metadata would fit in cache with the rest of the filerevision fields
21:38:45 <AaronSchulz> jynus: we do
21:38:49 <gwicke> so, if that was keyed on content hash, then we'd get the right association
21:38:54 <Krinkle> A new revision means a new upload, which would always bring new Exif. It could be unchanged, but it's part of the image file.
21:38:55 <jynus> ok, then keep doing it
21:39:01 <gwicke> & avoid the need to touch it when reverting
21:39:20 <Krinkle> Can we deterministically hash Exif and other meta data?
21:39:27 <jynus> we can improve it in the future, as we will probably will, but I think that is out of the scope?
21:39:29 <Krinkle> (Do we normalise first?)
21:39:31 <gwicke> we can hash the original
21:39:41 <gwicke> and just reuse the existing extracted metadata
21:39:54 <TimStarling> what's wrong with having another text table?
21:40:07 <Krinkle> Either way, yeah, we could have a separate image_metadata table. Would that table associate primary keys with blobs and/or references to external store? Whcih we reference in imagerevision? Or do we reference it directly?
21:40:28 <jynus> TimStarling, seems like a good idea whenever even if there is 1:1, you will not always read or write the entire row
21:40:51 <Platonides> we said years ago that we would change the image / oldimage structure
21:40:52 <gwicke> could the image revision metadata reference a hash, and the blob metadata then key on hash as well?
21:40:55 <TimStarling> maybe metadata for most things could be in filerevision, and the DjVu handler could have its own special pointer to ES
21:41:17 <TimStarling> that way you can efficiently load things like video duration when it's needed for list display
21:41:21 <AaronSchulz> gwicke: we currently cash djvu tree metadata by base file hash
21:41:21 <Krinkle> TimStarling: I meant that if we need a generic place to store blobs in a way that won't be moved, altered or deleted. We may want to re-use text instead of creating another one. I don't know if that's beneficial or not though.
21:41:32 <Krinkle> Given we already have sharding etc. and a scalable solution for that.
21:41:33 <jynus> gwicke, you keep saying hash, and if it is a hash or another kind of id, it doesn't matter
21:42:00 <TimStarling> when djvu OCR was introduced, we put it in img_metadata because there wasn't really a simple alternative
21:42:02 <jynus> we can store the hash separately or as an id
21:42:02 <gwicke> jynus: it matters a lot for dedup and cache invalidation
21:42:09 <jynus> ^
21:42:33 <Platonides> it would be nice to have ids for images
21:42:34 <Krinkle> TimStarling: I like that. So we have reason to avoid a separate table because we typically need to query both anyway?
21:42:56 <jynus> Krinkle, that is the right question
21:43:25 <jynus> separte if logically or physically is separated
21:43:28 <Krinkle> And that balances okay with the cost of having to read/write both since we would mostly only be writing it once? (aside from archiving). In the current schema in production, we'd avoid this because it'd need to move the metadata when a new revision is uploaded. But we won't have that probelm anymore.
21:43:30 <TimStarling> I'm not sure of the exact numbers, but file description pages for example obviously need metadata loaded, for all revisions of the file
21:43:48 <AaronSchulz> gwicke: you could always change the ES id if you change the metadata. We probably shouldn't focus on hash vs ID too much for that :)
21:43:53 <jynus> however, maybe that could be a decision that we do not need to take now?
21:44:01 <TimStarling> things that use ImageListPager probably need metadata too, since they generate thumbs
21:44:05 <tgr> yeah OCR text should be split out of generic metadata
21:44:11 <Krinkle> Okay, so that means for the current migration it is orthogonal. (Supporting external store for img_metadata)
21:44:14 <tgr> that's T32906 / T99263 fwiw
21:44:15 <stashbot> T32906: Store DjVu extracted text in a structured table instead of img_metadata - https://phabricator.wikimedia.org/T32906
21:44:15 <stashbot> T99263: Store Pdf extracted text in a structured table instead of img_metadata - https://phabricator.wikimedia.org/T99263
21:44:19 <TimStarling> hmm, and using images from within the parser will generate thumbs and so need metadata
21:44:21 <Krinkle> TimStarling: Should we do it before or after the schema change?
21:44:45 <TimStarling> can be separate, the only question we need to answer right now is whether filerevision.fr_metadata will exist
21:44:51 <TimStarling> and I guess the answer to that is yes
21:45:01 <robla> #info discussion of q2 has been about historic reason for having it (e.g. djvu)
21:45:08 <TimStarling> we don't have nearly enough time for question 3 now you know
21:45:22 <TimStarling> but we can give it a go
21:45:28 <Krinkle> Yeah :)
21:45:34 <DanielK_WMDE> SMalyshev: https://phabricator.wikimedia.org/T144447
21:45:38 <Krinkle> So last question 3: Migration.
21:45:45 <AaronSchulz> haha
21:45:47 <robla> #info question 3 discussion starts. Migration
21:46:02 <DanielK_WMDE> tgr: perhaps OCR text can wait for MCR? seems like an obviour candidate
21:46:04 <Krinkle> Background in 3. Migration at https://phabricator.wikimedia.org/T589#2541630
21:46:25 <jynus> I chatted with Tim about a potential path at the data side, I think I summarized on the ticket response
21:47:06 <Krinkle> I'd like to avoid adding temporary logic to MediaWiki for ignoring 'in-mid-air' rows since it's complicated technical debt to have to maintain. Especially since we support upgrades from older versions.
21:47:15 <jynus> keep X table doesn't mean we have to keep the names
21:47:19 <Krinkle> I fear we'd have to keep it forever.
21:47:39 <TimStarling> last time jynus mentioned views
21:48:04 <Krinkle> Good point.
21:48:07 <TimStarling> I don't have a fully-worked plan for using them but it seems like that could avoid the need for MW migration code
21:48:07 <gwicke> AaronSchulz: I agree that any blob id will help with cache invalidation, but there are differences in dedup, multi-repo use (commons vs. local project) etc; numeric ids are basically less efficient for a lot of those problems, but have the advantage of being slightly shorter
21:48:52 <Krinkle> So that would allow for migrating rows *to* the new table while keeping the old MediaWiki logic live.
21:49:01 <robla> how does non-duplication interact with 1) het deploy of MW versions and 2) ability to back it out if it fails?
21:49:03 <TimStarling> can you insert into a view?
21:49:19 <Platonides> TimStarling: no
21:49:21 <jynus> TimStarling, yes
21:49:26 <Platonides> what?
21:49:30 <Krinkle> :D
21:49:32 <jynus> if you have an "insertable view"
21:49:48 <DanielK_WMDE> https://dev.mysql.com/doc/refman/5.7/en/view-updatability.html
21:49:48 <jynus> which basically meas 1:1 relationship with the original table
21:49:53 * DanielK_WMDE just googled
21:49:54 <gwicke> robla: it's not like we are running two versions of commons at the same time right now
21:50:22 <jynus> you cannot insert into a "GROUP BY" view, means mostly
21:50:39 <robla> #info <TimStarling> can you insert into a view? <jynus> if you have an "insertable view"
21:50:40 <Krinkle> We always require backward compatibility indeed because of FileRepo on wmf wikis querying commons.
21:50:48 <Krinkle> Which can be one version ahead or behind Commons
21:50:53 <Krinkle> robla: thanks
21:50:59 <jynus> but you can into a "WHERE deleted=0"
21:51:22 <Platonides> I had created https://phabricator.wikimedia.org/T134827 precisely because I thought (after testing) that you could not insert into views
21:51:33 <jynus> ha ha
21:51:51 <jynus> so that is another reason to keep the original tables around (even if they are renamed)
21:51:51 <Krinkle> So if we temporarily add a 'fixme' column, and migrate rows from image to oldimage (to be renamed).
21:52:00 <jynus> actually
21:52:05 <jynus> the plan was the other way
21:52:10 <TimStarling> yeah, the other way
21:52:13 <Krinkle> I suppose we'd do that in the background and then keep doing that until it's small enough and then we go into read-only mode while we update the software?
21:52:14 <jynus> as image was larger
21:52:19 <Krinkle> Sorry, yeah, that's what I meant.
21:52:22 <TimStarling> rename image to filerevision, create an image view that filters filerevision
21:52:44 <Krinkle> I hope we don't do any "*" select anywhere
21:52:45 <Platonides> how efficient are table renames, jynus ?
21:52:51 <TimStarling> table renames are O(1)
21:52:55 <jynus> reana tables are cost "0" in mysql * (* conditions may apply)
21:53:02 <robla> #info <TimStarling> rename image to filerevision, create an image view that filters filerevision
21:53:03 <Platonides> xD
21:53:14 <Platonides> TimStarling: depends on the storage
21:53:15 <jynus> aka metadata lock, but that is an infra issue
21:53:29 <Platonides> I think there was something to take into account
21:53:31 <Krinkle> Given we need less fields in the new table, should we do the view the other way around?
21:53:42 <Platonides> just like renaming a db should be O(1)
21:53:46 <Platonides> but is actually not supported
21:54:00 <jynus> Krinkle, for performance reasons, you do not want to write heavility into a view
21:54:09 <jynus> poor indexing support
21:54:28 <Krinkle> Sure
21:54:33 <jynus> views do not have indexes, only the underlying table does
21:54:36 <TimStarling> if it's really necessary we can disable uploads for a few hours while migration takes place
21:54:42 <jynus> but if it works, it works
21:54:52 <jynus> (but it is a pain in labs)
21:54:58 <Platonides> how does that matter for writes?
21:54:59 <Krinkle> I think the whole migration will take longer than a few hours though, no?
21:55:08 <Krinkle> We need to do some of it in the background ideally
21:55:40 <TimStarling> what is the slowest part? copying all rows of commons.oldimage to commons.filerevision
21:55:42 <TimStarling> ?
21:55:46 <Krinkle> image->filerevision, create 'image' view for back-compat, add all oldimage rows to image.
21:55:54 <Krinkle> Right
21:56:06 <Krinkle> And adding primary keys
21:56:07 <TimStarling> can that be done with INSERT SELECT?
21:56:21 <TimStarling> on a depooled slave?
21:56:37 <TimStarling> that's why I say hours
21:56:40 <jynus> What?
21:57:00 <Krinkle> TimStarling: You want to do a musical chairs dance?
21:57:23 <jynus> I think the "new" table didn't have but an id and a few fields?
21:57:33 <jynus> a title, maybe
21:58:10 <DanielK_WMDE> a page_id, maybe?
21:58:11 <Scott_WUaS> thanks, All!
21:58:16 <jynus> sorry, I do not have the structure in front of me right now
21:58:18 <jynus> yes
21:58:18 <DanielK_WMDE> the link between file and page is a bit strange right now, no?
21:58:19 <Krinkle> Hm.. If we project old 'image' on top of new 'filerevision', we still need to create the new 'file' table.
21:58:21 <Krinkle> as well.
21:58:31 <TimStarling> you can't do INSERT SELECT on tens of millions of rows on the live master can you?
21:58:59 <Krinkle> Okay, I'll summarise input from question 1 and 2 on the task.
21:59:00 <jynus> TimStarling, I would say you cannot do it on a slave either- how do you do that an keep inserting?
21:59:08 <robla> we're just about out of scheduled time, but we'll go over 5 minutes (or longer if everyone prefers)
21:59:16 <Platonides> I think that would lag slaves
21:59:17 <TimStarling> we should wrap up
21:59:36 <robla> #info <Krinkle> Okay, I'll summarise input from question 1 and 2 on the task.
21:59:36 <Krinkle> We can continue talking in #wikimedia-tech jynus and TimStarling if you have time, and anyone else is welcome to join.
21:59:54 <TimStarling> I can't stay for long today
22:00:11 <robla> #info <Krinkle> We can continue talking in #wikimedia-tech jynus and TimStarling if you have time, and anyone else is welcome to join.
22:00:15 <Krinkle> TimStarling: Id like to make sure I understnad what you meant though.
22:00:52 <TimStarling> it really just needs design work
22:01:05 <TimStarling> whether it can be done in SQL should be more obvious once the design is further along
22:01:24 <Krinkle> Yeah, I think migration is clear enough for now that we know we have several optoins.
22:01:38 <jynus> My advice: avoid sql and insert selects
22:01:43 <Krinkle> We know that it can be done in a performant way. The details can be figured out in code review and elsewhere.
22:02:09 <Krinkle> For the actual migration that is. As long as we know we can do it.
22:02:09 <robla> ok, is that what we should wrap up on?
22:02:30 <robla> 60 second warning
22:02:42 <Krinkle> I guess it's a bit late for last-call now. I'll finalise over the coming week and propose for next week (no rfc meeting but just the last-call anouncement)
22:02:50 <Krinkle> sounds good?
22:03:07 <robla> ah, right, yeah, I think that sounds good
22:03:27 <TimStarling> we don't have a full design to approve at the moment, there are a lot of (literal) question marks in the phab task
22:03:45 <robla> Krinkle: you're proposing that you do the design, put it up, give people one week in last call?
22:03:45 <TimStarling> we can discuss procedure in next week's committee meeting
22:04:30 <Krinkle> robla: one week from next week.
22:04:34 <robla> yup. ok. now 30 second warning :-)
22:04:47 <robla> thanks everyone!
22:04:58 <Krinkle> TimStarling: Yeah, but good stuff to think about :)
22:05:09 <Krinkle> Thanks!
22:05:12 <robla> #endmeeting

Event Timeline

RobLa-WMF edited the content of this paste. (Show Details)Sep 1 2016, 12:00 AM
RobLa-WMF changed the title of this paste from untitled to Masterwork From Distant Lands.
RobLa-WMF changed the title of this paste from Masterwork From Distant Lands to ArchCom-RFC-2016W35-irc-E266.txt.Sep 1 2016, 1:50 AM