21:01:41 #startmeeting ArchCom Meeting about Multi-Content Revisions (T107595) 21:01:41 Meeting started Wed Sep 21 21:01:41 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:01:41 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:01:41 The meeting name has been set to 'archcom_meeting_about_multi_content_revisions__t107595_' 21:01:41 T107595: [RFC] Multi-Content Revisions - https://phabricator.wikimedia.org/T107595 21:01:41 Meeting started Wed Sep 21 21:01:41 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:01:41 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:01:41 The meeting name has been set to 'archcom_meeting_about_multi_content_revisions__t107595_' 21:01:42 hm, I'm still wondering whether we should go for the details questions first to get stuff done, or the broader questions first, for guidance... 21:02:11 #topic Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) |​ Logs: https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ 21:02:53 hi everyone 21:03:08 robla: do you think it would be ok to talk about schema details for half an hour, and the cut off and move on to discussing the migration? 21:03:55 DanielK_WMDE: possibly. what are you hoping we accomplish in today's conversation? 21:04:19 1) sort out the remaining details of what the schema should look like 21:04:35 2) get feedback about whether the migration plan is sane 21:05:21 (Hello:) 21:06:02 DanielK_WMDE: I'm assuming we're not ready to actually resolve the schema in the course of this hour though, correct? 21:06:39 not as a final decision. i do hope to get oppinions on my questions. 21:06:42 that plan sounds good to me 21:06:52 and perhaps even answers :) 21:07:06 so, the most important question regarding the schema is whether we should add one layer of indirection, or two. Adding only one layer of indirection means repeating the meta-data about the content of each slot for every revision. 21:07:44 Can you please post an example URL - re "The idea of this RFC is to allow multiple Content objects to be associated with a single revision (one per "slot"), resulting in multiple content "streams" for each page"? In what ways are Wikidata Q items involved here? 21:07:47 Doing it that way keeps the schema simpler, but means a lot of redundand data. The basic schema is then: 21:08:02 Scott_WUaS: they are not involved 21:08:22 Thanks 21:08:24 The "basic" version of the schema looks like this: 21:08:26 [page] --page_current--> [revision] <--cont_revision-- [content] --cont_address--> (text|external) 21:08:38 ok 21:09:08 As an alternative, we can add another table, the "slot" table, to tell us which content belongs to which revision, so the content-meta-data can be re-used for multiple (typically consecutive) revisions 21:09:44 so if we store categories in a separate slot, and the categories are nto touched by 10 edits, we would recycle the meta-data about the content of the category slot 10 times. 21:09:51 the schema would look like this: 21:09:57 [page] --page_current--> [revision] <--slot_revision-- [slots] --slot_content--> [content] --cont_address--> (text|external) 21:10:10 I guess we have no jynus this week 21:10:12 (DanielK_WMDE: Is there an existing example URL which you may develop further?) 21:11:01 schema details: https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Database_Schema 21:11:03 #link https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Database_Schema 21:11:15 #link https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Re-using_Content_Rows 21:11:31 thanks 21:11:48 TimStarling: looks like it... who else would have an oppinion on the schema? 21:12:01 DanielK_WMDE: is there an asynchronous conversation that is still moving forward? 21:12:20 no. not with me anywway 21:12:52 I can try to be surrogate jynus and raise a few of his points 21:13:19 great :) 21:13:20 TimStarling: that would be helpful. 21:13:24 my fear is that most of the asynchronous conversation has been in private email. that makes it hard to then hope for a good public IRC conversation 21:13:29 TimStarling: thanks! 21:13:35 surrogate jynus says: you want to store media info in a slot. Let's have a media_info table 21:13:47 yeah need to distill it down, the email convos were pretty high-bandwidth :) 21:13:48 then that table will be small and easy to handle 21:14:04 DanielK_WMDE:I wonder if it's good to hold current and old content in the same place... 21:14:14 TimStarling: what would the media_info table contain? the actual json blob? 21:14:17 in history, present a union between revision and media_info if users really really want that 21:14:34 unclear 21:14:38 SMalyshev: that's actually a good point leading -> to ideas about partitioning 'hot' and 'cold' data. for another time probably but we need to be thinking about it at some point 21:14:54 if we're already refactoring DB structure... 21:14:59 SMalyshev: so far, the answer looks like yes: moving data between tables when the current version becomes an archived version is a major pain. 21:15:10 (nitpick: if the slot table is only used as a many-to-many binding between revision and content, can we just call it revision_content? it's hard to keep up with the terminology) 21:15:34 SMalyshev: we (tim, mostly) moved main storage away from that 10 years ago, we are now planning to mave image meta data away from it too. but it's a possible parameter for partitioning. 21:15:43 tgr: I think the idea is that some of the slots are revision_content_derivedcontent thought. 21:15:52 from my perspective, what I'm really lacking about this MCR thing is any context on its higher-level purpose and utility. All of the details are deep, but no simple big picture about why we're doing this. 21:16:06 tgr: E.g. revision 3 -> wikitext -> JSON representation of the template or whatever. 21:16:12 tgr: i was called that, I changed it to be in line with the use of "slots" in the conceptual model. i don't care about the name 21:16:27 bblack: at a high level, we want to be able to break things out of wikitext into structured data that's still atomically versioned with the wikitext 21:16:35 DanielK_WMDE: what's the idea behind reusable content? I.e. is that useful for something? 21:16:46 brion: higher-level than that :) 21:16:56 bblack: there's a list of use cases 21:17:00 :) 21:17:02 I mean, wikitext does have some kind of structure. a single content can hav einternal structure in general 21:17:07 https://www.mediawiki.org/wiki/Multi-Content_Revisions#Use_Cases 21:17:21 bblack: "We want to move awat from MW's 1:1 relationship between "page" and "content"." 21:17:24 Err. Away. 21:17:24 TimStarling: that "unclear" bit is the problem i have with discussing the "store in dedicated table" option. how will the content be versioned? 21:17:47 bblack: https://www.mediawiki.org/wiki/Multi-Content_Revisions#Use_Cases 21:17:48 DanielK_WMDE: it would be linked to page and have its own timestamp 21:17:59 like a clone of revision 21:18:08 TimStarling: and it's own edit comment, reference to user, and so on? 21:18:09 TimStarling: So we'd JOIN on string-matched timestamps? 21:18:13 yes 21:18:15 Eww. 21:18:22 no 21:18:30 yes to DanielK_WMDE, no to James_F 21:18:34 Ah. 21:18:36 a related alternative would be to have each 'slot' live in a separate table, but all use the same revision key with metadata in revision. thus text edits would (or could) live in a separate table from revision too 21:18:37 TimStarling: so we would dublicate the revision table for each kind of content, and use unions everywhere we want to list revisions? 21:18:39 So it would have the revision_id in it? 21:18:46 but you'd have a consistent revision_id and place to search on 21:19:16 but there's some benefit in consistency and normalization, especially when we need to bulk-fetch data for dumps or otherwise handle them opaquely 21:19:22 at the SQL level you'd have several totally distinct revision concepts, like how oldimage and revision are separate now 21:19:34 TimStarling: i can't see that working, it sounds hideously complex to me. but maybe i'm just not seeing the elegance of it all. 21:19:35 #chair robla brion DanielK_WMDE TimStarling 21:19:35 Current chairs: DanielK_WMDE TimStarling brion robla 21:19:35 Current chairs: DanielK_WMDE TimStarling brion robla 21:19:37 at the application layer these may optionally be merged by a UNION 21:19:42 (what are the implications for multiple languages and translation here in Multi-Content Revisions, if any?) 21:20:01 brion: so, have one revision table, but basically one "content" table per slot? 21:20:02 * robla steps afk for 2 minutes 21:20:18 Scott_WUaS: interesting question. one _could_ store multiple wikitext Content items as well, one per language 21:20:19 brion: that's more doable, but still needs big jons or unions. 21:20:23 Scott_WUaS: "Complicated". There are options to fundamentally re-work Translate and parallel translation based on MCR, but this is a bit out of scope. 21:20:29 though i'm not sure it's ideal for the way translations get versioned 21:20:33 brion: *cough*DOM-based translation*cough* 21:20:36 FWIW, I think most of those use-cases sound like metadata more than parallel alternative content, except for the ones that seem like they could just be separate objects (e.g. template+css), or embedded documentation 21:20:48 thanks 21:20:59 bblack: the big reason i want MCR for 'separate objects' is atomic versioning 21:21:11 having a high-level abstraction in MW around several similar tables is an idea that was mentioned in that book jynus was passing around 21:21:13 template + css, gadget js+css, etc 21:21:20 you know, feature table and bug table 21:21:26 bblack: File description (wikitext), meta-data (JSON), and file (pointer to the BLOB) versioned together is the ambition. 21:21:30 TimStarling, brion: can we assume that the revision or content tables that would exist per slot would all contain *exactly* the same fields? 21:21:42 no 21:22:05 i think if we had separate tables they'd explicitly want to be different, otherwise it's only a partitioning mechanism 21:22:15 but that changes the interfaces 21:22:17 brion: that's what i'm thinking 21:22:19 if they're exactly the same then you have sharding, and jynus doesn't really seem keen on sharding 21:22:22 i just don't see how they would be different 21:22:30 I'll switch back from being pseudo-jynus to TimStarling for a second 21:22:30 and for data where the structured data would go straight into a table that makes sense 21:22:37 let's do sharding, I like sharding 21:22:38 for where everything's a big blob, i don't see the benefit of splitting 21:22:39 :) 21:22:52 what's your preferred axis to shard on here tim? 21:22:58 TimStarling: Do we have a plan for stopping the current tables from getting "too long" other than sharding? (Ignoring this change, which might make the rate of growth faster.) 21:23:18 TimStarling: yes, +1 for sharding/partitioning. let's have an RFC about that 21:23:33 yups 21:24:23 well, the existing recentchanges partitioning hack splits on user ID 21:24:27 brion: to level do you expect it to be atomic? you'd still be fetching js+css as 2x http fetches, right? it seems like there are ways to solve the problem of always fetching synced revs of such things simpler... 21:24:31 (i like the idea of a 'hot'/'cold' separation with a union-like interface, with a consistent revision id lineage so most things won't notice the difference other than potentially issuing two queries and combining them) 21:24:35 #info discussion of sharding for much of the first part of the meeting 21:24:40 which optimises for contributions queries 21:24:43 brion: re "everythign is a big blob": if we want to move away from that, we need a document oriented db. the content models we have would be a pain to model on an rdbms. not to mention that they would create absolutely humangous tables. 21:24:45 I've been lazily assuming that at some point we'd shard revision based on something (modulo the page_id?) but I don't know what's ideal. 21:24:52 bblack: http? oh no i mean inside, like the parser 21:25:02 or the html that specified which js/css to load 21:25:36 anyway i think we should address sharding/partitioning later, more explicitly 21:25:37 i would prefer to shard by mod(page_id). or timestamp blocks. 21:25:56 Yeah, let's fork that to another RfC. 21:26:12 one possibility is to duplicate the revision table: once with user-based sharding (for contributions), and again with page/timestamp sharding (for history) 21:26:22 denormalize the revision table, in other words 21:26:22 so, if that's for another rfc, can we move forward with this one? 21:26:44 bblack: so the alternative to atomic updates of multiple content blobs in one revision is to build another versioning abstraction on top of multiple pages 21:26:59 bblack: which is certainly possible too 21:27:08 TimStarling: basically, duplicate it. yea. 21:27:23 so, key question: is is ok to maintain the meta-data for all slot content in a single table? 21:27:35 with sharding to be descussed? 21:27:44 I think the key question is project order: does sharding/partitioning block MCR? 21:27:45 brion: or question why we're trying to version-sync css+js inside wiki articles in the first place... 21:27:51 DanielK_WMDE: i say yes, as long as we keep it compact and have a future plan to shard that won't explode based on our changes :D 21:28:12 bblack: well "scratch mediawiki, just use github" is a third option ;) 21:28:30 TimStarling: that's also an important question, yes, though i think we can decide on the schema without knowing whether implementation is blocked on sharding 21:28:34 I suspect jynus is on the verge of vetoing MCR until we have better scalability 21:28:43 it seems to be ok to have _lots of rows_ (tall tables) as long as those table rows are small (narrow) 21:29:09 data size is a relevant metric, yes 21:29:16 TimStarling: i'm fine with him vetoing implementation on this grounds. but i need to know whether and how i should change the design. 21:29:36 implementaion o nthe cluster = deployment 21:29:38 for example, you have to copy all the data in a table during ALTER TABLE, and that is becoming a problem 21:29:54 remember it was a problem in the olden days too 21:30:09 brion: or any of the thousands of saner ways to develop->deploy css and js than "do it inside the wiki it's meant to operate on, shoe-horning it in as if it's like article content, and then remodel the wiki software to support that use case poorly" 21:31:01 bblack: if you want it to be user-maintained, i don't really see an alternative. but the css/js use case isn't really at the focus of this. 21:31:06 (not entirely fair, but as fair as your github retort) 21:31:22 bblack: oh sure, you're not wrong. :) there's tradeoffs in all these directions 21:31:41 and honestly using a git-oriented backend for code? not an awful ideal at all 21:31:53 i'm stilly trying to find out whether i can go ahead with implementing the revision<-slot->content schema 21:32:04 brion: It's on the backlog. Let's not get further distracted from the RfC. ;-) 21:32:07 but even if we broke out gadgets/userscripts we've got these on-wiki data objects :D 21:32:10 yep 21:32:12 or whether all work on this needs to rest until we have an rfc on optimizing revision storage & sharding 21:32:24 I don't see how you can implement it if you can't deploy it 21:32:25 or whether there is a concrete request to change the db schema i propos 21:32:35 I get an impression that jynus has to answer that :) 21:33:17 jynus is always reluctant to use the veto power we keep wanting to give him :) 21:33:18 be gentle 21:33:27 TimStarling: we can get the code ready for deployment while we are also working on, or deciding on, optimization strategies for revision storage. 21:33:37 I don't think we're going to get on board with jynus's idea of splitting the revision concept 21:33:54 but I think we should work by consensus 21:34:03 *nod* 21:34:24 is jynus's idea spelled out somewhere? 21:34:36 so if we want consensus but won't get on board with his idea, then we need to convince him?... 21:34:45 we've got some bits of discussions, no concrete alt proposal 21:34:59 robla: no, not really, he was reluctant to dive in and do fully worked schema 21:35:12 DanielK_WMDE: right 21:35:35 i have tried and failed 21:36:38 DanielK_WMDE: I think one thing that may be slowing this conversation down is it getting too bogged down in details 21:36:59 there's a *lot* to sort through here: https://www.mediawiki.org/wiki/Multi-Content_Revisions 21:37:04 I don't want to get into detail about tactics in this discussion 21:37:40 how would it work to implement it but not deploy it? would you be able to have a feature flag in MW? or would it have to be a branch? 21:37:52 robla: yes, that's why I announced only the schema bit as today's topic: https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data 21:37:58 Branch or unmerged commit. 21:38:21 robla: that's already quite a bit, but I think it is managable. 21:38:32 where are we meant to have the bigger discussion? I just don't get artchitecting the details before having some consensus that this is the right model for some real use-cases. The use-cases section mentions its own speculative nature, many of them are more metadata than parallel separate content, which is an entirely simpler case to handle. the rest are questionable, IMHO... 21:38:58 maybe that's for my lack of information, but still 21:39:01 TimStarling: We will need feature flags for the migration/transition anyway. So, yes. 21:39:22 it would be nice to have say two initial use cases which will be initially implemented 21:39:31 TimStarling: hopefully, if/the/else cruft can be kept to a minimum be swapping in alternative implementation of the relevant components. 21:40:13 * brion hmms 21:40:18 TimStarling: thw first two in the list: MediaInfo and PageAssessments. 21:40:29 That could work. 21:40:39 MassMessage is also a hot candidate I think 21:40:44 And TemplateData. ;-) 21:40:55 (As it's so simple.) 21:41:00 ok i think i'm going to try fleshing out an alt proposal along some, but not all, of jynus and surrogate-jynus's lines, and we can just compare that 21:41:06 presumably we will have an MCR-aware API, and all the if/else will be in the implementation of that API 21:41:08 it'll be good to have some key use cases to go along with that 21:41:29 RevisionLookup 21:41:33 bblack: if it's editable and versioned, it's not meta-data 21:41:50 cause if we do concentrate on cases where the secondary slots are special kinds of data, maybe extra tables aren't too awful. but maybe they are ;) 21:42:02 TimStarling: yes, exactly 21:42:34 maybe we should start moving towards rev_id being opaque rather than an auto-increment integer 21:42:55 but still an integer? 21:42:57 brion: i'm not thinking of secondary (derived) slots any more. just primary user editable content. 21:43:06 a UUID might make more sense if it is sharded 21:43:07 right, sorry wrong term :) 21:43:13 i mean non-main-wikitext slots 21:43:25 but yes, still an integer initially 21:43:30 but maybe type-hinted as a string 21:43:34 TimStarling: or a time-uuid. gabriel loves those. 21:43:38 TimStarling: for multi-master insert that can be important yes 21:43:48 But they are big. We are trying to make that table smaller, right? 21:43:58 bigints are smaller :) 21:44:00 (but we are discussing the revision table again) 21:44:02 i will just warn about Bigints and the JavaScript/node 53-bit limit though 21:44:16 brion: extra tables would need some PHP-layer abstraction on top of our current DB abstraction, for all code that needs to search or iterate all content. That seems scary. 21:44:29 tgr: very. 21:44:30 reminds me of https://gerrit.wikimedia.org/r/#/c/16696/20/includes/rdbstore/RDBStore.php 21:44:45 * AaronSchulz almost forgot about that, haha 21:44:49 tgr: yeah, at least some would need to add to the tables joined on things. others would not actually need to touch those tables, though, and would only care about what's in revision i think 21:44:50 brion: you are worried that we will exceed 2^53 rows in a table? ;) 21:45:07 (of course half of that was wild experimentation that would never be used) 21:45:12 depends how fine-grained we make editing ;) 21:45:16 AaronSchulz: PTSD flashbacks to that code? ;-) 21:45:17 AaronSchulz: that's basically home grown partitioning, right? 21:45:26 non-sequential revids may be problematic as it'd be impossible to know the order 21:45:37 mianly i was thinking if we do something clever like a 64-bit mini uuid 21:46:02 i'm getting worried that I'm stranded with this with no way to actively move forward. 21:46:11 yeah :( 21:47:00 can i at least get some feedback on "super tall content table" vs "not-so-tall content table + super tall slots table"? 21:47:17 i am strongly in favor of super tall slots table 21:47:22 I like the second one better 21:47:23 as in https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Re-using_Content_Rows 21:47:24 lets us keep the content table much smaller 21:47:26 bblack: maybe you can discuss your concerns on https://www.mediawiki.org/wiki/Talk:Multi-Content_Revisions ? 21:47:43 if we're going to have huge table, it's better to have it as "narrow" as possible 21:47:46 basically this proposal is blocked on deciding how to handle very tall tables, which is something that needs to be decided soon anyway, right? 21:48:00 so maybe just give up for now and make that decision happen as soon as possible? 21:48:01 ok. how about I work on some strawman code that allows us to look at the schema with some data in it, maybe on labs? 21:48:13 would that help, or would it be a waste of time? 21:48:14 is there a wiki page / talk page / phab task that discusses ops concerns with the MCR proposal? 21:48:30 subbu: I'm not aware of any 21:48:43 tgr: i'm not sure there is a generic answer to that question. it may very much depend on the table. 21:49:41 subbu: there is one comment by jynus: https://www.mediawiki.org/wiki/Topic:Tb6fok3z43ar16fe 21:49:58 i frankly can't extract much guidance from it 21:50:56 "I will create an alternative one" -- maybe we just need to nag jynus to write that 21:51:03 thanks for the refresher about the link, DanielK_WMDE 21:51:16 #link https://www.mediawiki.org/wiki/Topic:Tb6fok3z43ar16fe 21:51:27 TimStarling: please do, i'm quite curious 21:52:01 well, on the tactics front, I'm hoping that ArchCom doesn't become NagCom ;-) 21:52:08 heh 21:52:21 or ArgCom... 21:52:47 I think it may be a useful conversation starter to *attempt* to come up with what jynus is shooting for 21:53:00 DanielK_WMDE: I'm up for discussing partitioning, since I still remember thinking about that a lot in the past. My inclination is tall-and-narrow metadata => sharded blobs though 21:53:21 robla: i honestly can't imagine how it would work. if i could, i would have propsoed it. 21:53:50 AaronSchulz: yes, i'm with you there. And I also think we should discuss sharding. 21:53:53 I mentioned some ideas about making revision narrower, jynus was receptive to those 21:54:02 AaronSchulz: who's going to drive that conversation? 21:54:40 like splitting out rev_comment, you know we have a bug to make rev_comment be larger than 255 bytes 21:54:56 * AaronSchulz shrugs...probably would be good to know about what parameters jynus wants 21:55:00 yeah rev_comment and rev_user_text are easy wins 21:55:03 (Hoping all can keep this helpful conversation going - https://www.mediawiki.org/wiki/Topic:Tb6fok3z43ar16fe ) 21:55:43 AaronSchulz: yes, that would be good 21:56:04 TimStarling, brion: so, who's going to write an rfc about optimizing row size in rrevision? 21:56:18 TimStarling: Would we make rev_comment just another slot? 21:56:41 i can do that if TimStarling isn't excited about it, we have some good ideas from last week's offline discussion 21:57:13 * brion compacts ALL the rows! 21:57:19 brion: yay :) 21:57:27 ok brion, compact away, I will comment on it 21:57:31 i'm happy to help and give input, but i don't see me driving this 21:57:33 too much on my plate 21:57:51 no worries 21:57:55 the problem with pausing MCR is: i have mde room for this in my schedule *now* 21:57:56 #info 14:55:00 yeah rev_comment and rev_user_text are easy wins 21:58:06 if we drop this for 3 months, I have *no* idea when i can get back on working on it 21:58:11 great i'll write those up next couple days 21:58:15 it also pushes back the sche4dule for structured commons 21:58:31 do we need MCR for structured commons? 21:58:39 (Thanks, All!) 21:58:51 I mean need like "no way we can do structured commons without it"? 21:58:56 It would be super if we could not delay that again... 21:59:06 #info re "super tall content table" vs "not-so-tall content table + super tall slots table": i am strongly in favor of super tall slots table I like the second one better 21:59:43 #info brion to write up additional RfC on compacting rows in revision table (should apply with or without MCR) 22:00:06 ok...should we end the official part of this meeting on that? 22:00:09 brion: will partitioning be part of that? 22:00:23 * robla plans to hit #endmeeting in 120 seconds 22:00:36 SMalyshev: pretty much, yes. 22:00:41 DanielK_WMDE: not explicitly but i'll mention some related concerns 22:01:06 can expand to that if we decide we must super-prioritize it 22:01:12 SMalyshev: at least if we want to stick to the product requirements as set out by the WMF back in the day. 22:01:17 brion, thanks for taking that on! 22:01:36 DanielK_WMDE: well, you say you can implement it with a feature switch, which should be relatively uncontroversial 22:01:36 :D 22:01:36 so, reading that talk page topic, iiuc, jynus is objecting to using a single unified table for all slots and prefers different tables for different slots? 22:02:09 we can continue the conversation in #wikimedia-tech for those that want to 22:02:27 thanks all! 22:02:32 #endmeeting