Agenda
- Location: #wikimedia-office IRC channel
- Meeting type: TBD
- Time: Weekly, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST)
- Topic:
- T107595 - Multi-Content Revisions
This meeting is mainly about the Content Meta-Data Storage for MCR.
Detail questions:
- Do we want a single "names" table, or separate tables for different kinds of names, i.e. content_model, content_format, etc?
- Can we drop need cont_hash (or cont_sha1) and cont_logical_size?
- De we re-use or copy content rows?
- If we re-use, does the role live in the content or in the slot label?
Broader questions:
- Are the scaling and Efficiency estimates correct?
- What options do we have for optimization?
- Will the proposed migration plan work?
Meeting summary
- Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) |​ Logs: https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ (robla, 21:02:11)
- LINK: https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Database_Schema (DanielK_WMDE, 21:11:03)
- LINK: https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Re-using_Content_Rows (DanielK_WMDE, 21:11:15)
- LINK: https://www.mediawiki.org/wiki/Multi-Content_Revisions#Use_Cases (TimStarling, 21:17:07)
- discussion of sharding for much of the first part of the meeting (robla, 21:24:35)
- LINK: https://www.mediawiki.org/wiki/Topic:Tb6fok3z43ar16fe (DanielK_WMDE, 21:51:16)
- 14:55:00Â <brion>Â yeah rev_comment and rev_user_text are easy wins (robla, 21:57:56)
- re "super tall content table" vs "not-so-tall content table + super tall slots table": <brion> i am strongly in favor of super tall slots table <SMalyshev> I like the second one better (DanielK_WMDE, 21:59:06)
- brion to write up additional RfC on compacting rows in revision table (should apply with or without MCR) (brion, 21:59:43)
Meeting ended at 22:02:32 UTC.
People present (lines said)
- DanielK_WMDE (92)
- brion (61)
- TimStarling (52)
- robla (25)
- James_F (21)
- bblack (10)
- Scott_WUaS (10)
- SMalyshev (9)
- AaronSchulz (6)
- tgr (4)
- wm-labs-meetbot (4)
- wm-labs-meetbot` (4)
- subbu (2)
- stashbot (1)
- marktraceur (1)
Log
| 1 | 21:01:41 <robla> #startmeeting ArchCom Meeting about Multi-Content Revisions (T107595) |
|---|---|
| 2 | 21:01:41 <wm-labs-meetbot> Meeting started Wed Sep 21 21:01:41 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. |
| 3 | 21:01:41 <wm-labs-meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. |
| 4 | 21:01:41 <wm-labs-meetbot> The meeting name has been set to 'archcom_meeting_about_multi_content_revisions__t107595_' |
| 5 | 21:01:41 <stashbot> T107595: [RFC] Multi-Content Revisions - https://phabricator.wikimedia.org/T107595 |
| 6 | 21:01:41 <wm-labs-meetbot`> Meeting started Wed Sep 21 21:01:41 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. |
| 7 | 21:01:41 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. |
| 8 | 21:01:41 <wm-labs-meetbot`> The meeting name has been set to 'archcom_meeting_about_multi_content_revisions__t107595_' |
| 9 | 21:01:42 <DanielK_WMDE> hm, I'm still wondering whether we should go for the details questions first to get stuff done, or the broader questions first, for guidance... |
| 10 | 21:02:11 <robla> #topic Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ |
| 11 | 21:02:53 <robla> hi everyone |
| 12 | 21:03:08 <DanielK_WMDE> robla: do you think it would be ok to talk about schema details for half an hour, and the cut off and move on to discussing the migration? |
| 13 | 21:03:55 <robla> DanielK_WMDE: possibly. what are you hoping we accomplish in today's conversation? |
| 14 | 21:04:19 <DanielK_WMDE> 1) sort out the remaining details of what the schema should look like |
| 15 | 21:04:35 <DanielK_WMDE> 2) get feedback about whether the migration plan is sane |
| 16 | 21:05:21 <Scott_WUaS> (Hello:) |
| 17 | 21:06:02 <robla> DanielK_WMDE: I'm assuming we're not ready to actually resolve the schema in the course of this hour though, correct? |
| 18 | 21:06:39 <DanielK_WMDE> not as a final decision. i do hope to get oppinions on my questions. |
| 19 | 21:06:42 <TimStarling> that plan sounds good to me |
| 20 | 21:06:52 <DanielK_WMDE> and perhaps even answers :) |
| 21 | 21:07:06 <DanielK_WMDE> so, the most important question regarding the schema is whether we should add one layer of indirection, or two. Adding only one layer of indirection means repeating the meta-data about the content of each slot for every revision. |
| 22 | 21:07:44 <Scott_WUaS> Can you please post an example URL - re "The idea of this RFC is to allow multiple Content objects to be associated with a single revision (one per "slot"), resulting in multiple content "streams" for each page"? In what ways are Wikidata Q items involved here? |
| 23 | 21:07:47 <DanielK_WMDE> Doing it that way keeps the schema simpler, but means a lot of redundand data. The basic schema is then: |
| 24 | 21:08:02 <DanielK_WMDE> Scott_WUaS: they are not involved |
| 25 | 21:08:22 <Scott_WUaS> Thanks |
| 26 | 21:08:24 <DanielK_WMDE> The "basic" version of the schema looks like this: |
| 27 | 21:08:26 <DanielK_WMDE> [page] --page_current--> [revision] <--cont_revision-- [content] --cont_address--> (text|external) |
| 28 | 21:08:38 <Scott_WUaS> ok |
| 29 | 21:09:08 <DanielK_WMDE> As an alternative, we can add another table, the "slot" table, to tell us which content belongs to which revision, so the content-meta-data can be re-used for multiple (typically consecutive) revisions |
| 30 | 21:09:44 <DanielK_WMDE> so if we store categories in a separate slot, and the categories are nto touched by 10 edits, we would recycle the meta-data about the content of the category slot 10 times. |
| 31 | 21:09:51 <DanielK_WMDE> the schema would look like this: |
| 32 | 21:09:57 <DanielK_WMDE> [page] --page_current--> [revision] <--slot_revision-- [slots] --slot_content--> [content] --cont_address--> (text|external) |
| 33 | 21:10:10 <TimStarling> I guess we have no jynus this week |
| 34 | 21:10:12 <Scott_WUaS> (DanielK_WMDE: Is there an existing example URL which you may develop further?) |
| 35 | 21:11:01 <DanielK_WMDE> schema details: https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Database_Schema |
| 36 | 21:11:03 <DanielK_WMDE> #link https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Database_Schema |
| 37 | 21:11:15 <DanielK_WMDE> #link https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Re-using_Content_Rows |
| 38 | 21:11:31 <Scott_WUaS> thanks |
| 39 | 21:11:48 <DanielK_WMDE> TimStarling: looks like it... who else would have an oppinion on the schema? |
| 40 | 21:12:01 <robla> DanielK_WMDE: is there an asynchronous conversation that is still moving forward? |
| 41 | 21:12:20 <DanielK_WMDE> no. not with me anywway |
| 42 | 21:12:52 <TimStarling> I can try to be surrogate jynus and raise a few of his points |
| 43 | 21:13:19 <brion> great :) |
| 44 | 21:13:20 <DanielK_WMDE> TimStarling: that would be helpful. |
| 45 | 21:13:24 <robla> my fear is that most of the asynchronous conversation has been in private email. that makes it hard to then hope for a good public IRC conversation |
| 46 | 21:13:29 <robla> TimStarling: thanks! |
| 47 | 21:13:35 <TimStarling> surrogate jynus says: you want to store media info in a slot. Let's have a media_info table |
| 48 | 21:13:47 <brion> yeah need to distill it down, the email convos were pretty high-bandwidth :) |
| 49 | 21:13:48 <TimStarling> then that table will be small and easy to handle |
| 50 | 21:14:04 <SMalyshev> DanielK_WMDE:I wonder if it's good to hold current and old content in the same place... |
| 51 | 21:14:14 <DanielK_WMDE> TimStarling: what would the media_info table contain? the actual json blob? |
| 52 | 21:14:17 <TimStarling> in history, present a union between revision and media_info if users really really want that |
| 53 | 21:14:34 <TimStarling> unclear |
| 54 | 21:14:38 <brion> SMalyshev: that's actually a good point leading -> to ideas about partitioning 'hot' and 'cold' data. for another time probably but we need to be thinking about it at some point |
| 55 | 21:14:54 <SMalyshev> if we're already refactoring DB structure... |
| 56 | 21:14:59 <DanielK_WMDE> SMalyshev: so far, the answer looks like yes: moving data between tables when the current version becomes an archived version is a major pain. |
| 57 | 21:15:10 <tgr> (nitpick: if the slot table is only used as a many-to-many binding between revision and content, can we just call it revision_content? it's hard to keep up with the terminology) |
| 58 | 21:15:34 <DanielK_WMDE> SMalyshev: we (tim, mostly) moved main storage away from that 10 years ago, we are now planning to mave image meta data away from it too. but it's a possible parameter for partitioning. |
| 59 | 21:15:43 <James_F> tgr: I think the idea is that some of the slots are revision_content_derivedcontent thought. |
| 60 | 21:15:52 <bblack> from my perspective, what I'm really lacking about this MCR thing is any context on its higher-level purpose and utility. All of the details are deep, but no simple big picture about why we're doing this. |
| 61 | 21:16:06 <James_F> tgr: E.g. revision 3 -> wikitext -> JSON representation of the template or whatever. |
| 62 | 21:16:12 <DanielK_WMDE> tgr: i was called that, I changed it to be in line with the use of "slots" in the conceptual model. i don't care about the name |
| 63 | 21:16:27 <brion> bblack: at a high level, we want to be able to break things out of wikitext into structured data that's still atomically versioned with the wikitext |
| 64 | 21:16:35 <SMalyshev> DanielK_WMDE: what's the idea behind reusable content? I.e. is that useful for something? |
| 65 | 21:16:46 <bblack> brion: higher-level than that :) |
| 66 | 21:16:56 <TimStarling> bblack: there's a list of use cases |
| 67 | 21:17:00 <brion> :) |
| 68 | 21:17:02 <bblack> I mean, wikitext does have some kind of structure. a single content can hav einternal structure in general |
| 69 | 21:17:07 <TimStarling> https://www.mediawiki.org/wiki/Multi-Content_Revisions#Use_Cases |
| 70 | 21:17:21 <James_F> bblack: "We want to move awat from MW's 1:1 relationship between "page" and "content"." |
| 71 | 21:17:24 <James_F> Err. Away. |
| 72 | 21:17:24 <DanielK_WMDE> TimStarling: that "unclear" bit is the problem i have with discussing the "store in dedicated table" option. how will the content be versioned? |
| 73 | 21:17:47 <DanielK_WMDE> bblack: https://www.mediawiki.org/wiki/Multi-Content_Revisions#Use_Cases |
| 74 | 21:17:48 <TimStarling> DanielK_WMDE: it would be linked to page and have its own timestamp |
| 75 | 21:17:59 <TimStarling> like a clone of revision |
| 76 | 21:18:08 <DanielK_WMDE> TimStarling: and it's own edit comment, reference to user, and so on? |
| 77 | 21:18:09 <James_F> TimStarling: So we'd JOIN on string-matched timestamps? |
| 78 | 21:18:13 <TimStarling> yes |
| 79 | 21:18:15 <James_F> Eww. |
| 80 | 21:18:22 <TimStarling> no |
| 81 | 21:18:30 <TimStarling> yes to DanielK_WMDE, no to James_F |
| 82 | 21:18:34 <James_F> Ah. |
| 83 | 21:18:36 <brion> a related alternative would be to have each 'slot' live in a separate table, but all use the same revision key with metadata in revision. thus text edits would (or could) live in a separate table from revision too |
| 84 | 21:18:37 <DanielK_WMDE> TimStarling: so we would dublicate the revision table for each kind of content, and use unions everywhere we want to list revisions? |
| 85 | 21:18:39 <James_F> So it would have the revision_id in it? |
| 86 | 21:18:46 <brion> but you'd have a consistent revision_id and place to search on |
| 87 | 21:19:16 <brion> but there's some benefit in consistency and normalization, especially when we need to bulk-fetch data for dumps or otherwise handle them opaquely |
| 88 | 21:19:22 <TimStarling> at the SQL level you'd have several totally distinct revision concepts, like how oldimage and revision are separate now |
| 89 | 21:19:34 <DanielK_WMDE> TimStarling: i can't see that working, it sounds hideously complex to me. but maybe i'm just not seeing the elegance of it all. |
| 90 | 21:19:35 <robla> #chair robla brion DanielK_WMDE TimStarling |
| 91 | 21:19:35 <wm-labs-meetbot> Current chairs: DanielK_WMDE TimStarling brion robla |
| 92 | 21:19:35 <wm-labs-meetbot`> Current chairs: DanielK_WMDE TimStarling brion robla |
| 93 | 21:19:37 <TimStarling> at the application layer these may optionally be merged by a UNION |
| 94 | 21:19:42 <Scott_WUaS> (what are the implications for multiple languages and translation here in Multi-Content Revisions, if any?) |
| 95 | 21:20:01 <DanielK_WMDE> brion: so, have one revision table, but basically one "content" table per slot? |
| 96 | 21:20:02 * robla steps afk for 2 minutes |
| 97 | 21:20:18 <brion> Scott_WUaS: interesting question. one _could_ store multiple wikitext Content items as well, one per language |
| 98 | 21:20:19 <DanielK_WMDE> brion: that's more doable, but still needs big jons or unions. |
| 99 | 21:20:23 <James_F> Scott_WUaS: "Complicated". There are options to fundamentally re-work Translate and parallel translation based on MCR, but this is a bit out of scope. |
| 100 | 21:20:29 <brion> though i'm not sure it's ideal for the way translations get versioned |
| 101 | 21:20:33 <James_F> brion: *cough*DOM-based translation*cough* |
| 102 | 21:20:36 <bblack> FWIW, I think most of those use-cases sound like metadata more than parallel alternative content, except for the ones that seem like they could just be separate objects (e.g. template+css), or embedded documentation |
| 103 | 21:20:48 <Scott_WUaS> thanks |
| 104 | 21:20:59 <brion> bblack: the big reason i want MCR for 'separate objects' is atomic versioning |
| 105 | 21:21:11 <TimStarling> having a high-level abstraction in MW around several similar tables is an idea that was mentioned in that book jynus was passing around |
| 106 | 21:21:13 <brion> template + css, gadget js+css, etc |
| 107 | 21:21:20 <TimStarling> you know, feature table and bug table |
| 108 | 21:21:26 <James_F> bblack: File description (wikitext), meta-data (JSON), and file (pointer to the BLOB) versioned together is the ambition. |
| 109 | 21:21:30 <DanielK_WMDE> TimStarling, brion: can we assume that the revision or content tables that would exist per slot would all contain *exactly* the same fields? |
| 110 | 21:21:42 <TimStarling> no |
| 111 | 21:22:05 <brion> i think if we had separate tables they'd explicitly want to be different, otherwise it's only a partitioning mechanism |
| 112 | 21:22:15 <brion> but that changes the interfaces |
| 113 | 21:22:17 <DanielK_WMDE> brion: that's what i'm thinking |
| 114 | 21:22:19 <TimStarling> if they're exactly the same then you have sharding, and jynus doesn't really seem keen on sharding |
| 115 | 21:22:22 <DanielK_WMDE> i just don't see how they would be different |
| 116 | 21:22:30 <TimStarling> I'll switch back from being pseudo-jynus to TimStarling for a second |
| 117 | 21:22:30 <brion> and for data where the structured data would go straight into a table that makes sense |
| 118 | 21:22:37 <TimStarling> let's do sharding, I like sharding |
| 119 | 21:22:38 <brion> for where everything's a big blob, i don't see the benefit of splitting |
| 120 | 21:22:39 <brion> :) |
| 121 | 21:22:52 <brion> what's your preferred axis to shard on here tim? |
| 122 | 21:22:58 <James_F> TimStarling: Do we have a plan for stopping the current tables from getting "too long" other than sharding? (Ignoring this change, which might make the rate of growth faster.) |
| 123 | 21:23:18 <DanielK_WMDE> TimStarling: yes, +1 for sharding/partitioning. let's have an RFC about that |
| 124 | 21:23:33 <brion> yups |
| 125 | 21:24:23 <TimStarling> well, the existing recentchanges partitioning hack splits on user ID |
| 126 | 21:24:27 <bblack> brion: to level do you expect it to be atomic? you'd still be fetching js+css as 2x http fetches, right? it seems like there are ways to solve the problem of always fetching synced revs of such things simpler... |
| 127 | 21:24:31 <brion> (i like the idea of a 'hot'/'cold' separation with a union-like interface, with a consistent revision id lineage so most things won't notice the difference other than potentially issuing two queries and combining them) |
| 128 | 21:24:35 <robla> #info discussion of sharding for much of the first part of the meeting |
| 129 | 21:24:40 <TimStarling> which optimises for contributions queries |
| 130 | 21:24:43 <DanielK_WMDE> brion: re "everythign is a big blob": if we want to move away from that, we need a document oriented db. the content models we have would be a pain to model on an rdbms. not to mention that they would create absolutely humangous tables. |
| 131 | 21:24:45 <James_F> I've been lazily assuming that at some point we'd shard revision based on something (modulo the page_id?) but I don't know what's ideal. |
| 132 | 21:24:52 <brion> bblack: http? oh no i mean inside, like the parser |
| 133 | 21:25:02 <brion> or the html that specified which js/css to load |
| 134 | 21:25:36 <brion> anyway i think we should address sharding/partitioning later, more explicitly |
| 135 | 21:25:37 <DanielK_WMDE> i would prefer to shard by mod(page_id). or timestamp blocks. |
| 136 | 21:25:56 <James_F> Yeah, let's fork that to another RfC. |
| 137 | 21:26:12 <TimStarling> one possibility is to duplicate the revision table: once with user-based sharding (for contributions), and again with page/timestamp sharding (for history) |
| 138 | 21:26:22 <TimStarling> denormalize the revision table, in other words |
| 139 | 21:26:22 <DanielK_WMDE> so, if that's for another rfc, can we move forward with this one? |
| 140 | 21:26:44 <brion> bblack: so the alternative to atomic updates of multiple content blobs in one revision is to build another versioning abstraction on top of multiple pages |
| 141 | 21:26:59 <brion> bblack: which is certainly possible too |
| 142 | 21:27:08 <DanielK_WMDE> TimStarling: basically, duplicate it. yea. |
| 143 | 21:27:23 <DanielK_WMDE> so, key question: is is ok to maintain the meta-data for all slot content in a single table? |
| 144 | 21:27:35 <DanielK_WMDE> with sharding to be descussed? |
| 145 | 21:27:44 <TimStarling> I think the key question is project order: does sharding/partitioning block MCR? |
| 146 | 21:27:45 <bblack> brion: or question why we're trying to version-sync css+js inside wiki articles in the first place... |
| 147 | 21:27:51 <brion> DanielK_WMDE: i say yes, as long as we keep it compact and have a future plan to shard that won't explode based on our changes :D |
| 148 | 21:28:12 <brion> bblack: well "scratch mediawiki, just use github" is a third option ;) |
| 149 | 21:28:30 <DanielK_WMDE> TimStarling: that's also an important question, yes, though i think we can decide on the schema without knowing whether implementation is blocked on sharding |
| 150 | 21:28:34 <TimStarling> I suspect jynus is on the verge of vetoing MCR until we have better scalability |
| 151 | 21:28:43 <brion> it seems to be ok to have _lots of rows_ (tall tables) as long as those table rows are small (narrow) |
| 152 | 21:29:09 <TimStarling> data size is a relevant metric, yes |
| 153 | 21:29:16 <DanielK_WMDE> TimStarling: i'm fine with him vetoing implementation on this grounds. but i need to know whether and how i should change the design. |
| 154 | 21:29:36 <DanielK_WMDE> implementaion o nthe cluster = deployment |
| 155 | 21:29:38 <TimStarling> for example, you have to copy all the data in a table during ALTER TABLE, and that is becoming a problem |
| 156 | 21:29:54 <TimStarling> remember it was a problem in the olden days too |
| 157 | 21:30:09 <bblack> brion: or any of the thousands of saner ways to develop->deploy css and js than "do it inside the wiki it's meant to operate on, shoe-horning it in as if it's like article content, and then remodel the wiki software to support that use case poorly" |
| 158 | 21:31:01 <DanielK_WMDE> bblack: if you want it to be user-maintained, i don't really see an alternative. but the css/js use case isn't really at the focus of this. |
| 159 | 21:31:06 <bblack> (not entirely fair, but as fair as your github retort) |
| 160 | 21:31:22 <brion> bblack: oh sure, you're not wrong. :) there's tradeoffs in all these directions |
| 161 | 21:31:41 <brion> and honestly using a git-oriented backend for code? not an awful ideal at all |
| 162 | 21:31:53 <DanielK_WMDE> i'm stilly trying to find out whether i can go ahead with implementing the revision<-slot->content schema |
| 163 | 21:32:04 <James_F> brion: It's on the backlog. Let's not get further distracted from the RfC. ;-) |
| 164 | 21:32:07 <brion> but even if we broke out gadgets/userscripts we've got these on-wiki data objects :D |
| 165 | 21:32:10 <brion> yep |
| 166 | 21:32:12 <DanielK_WMDE> or whether all work on this needs to rest until we have an rfc on optimizing revision storage & sharding |
| 167 | 21:32:24 <TimStarling> I don't see how you can implement it if you can't deploy it |
| 168 | 21:32:25 <DanielK_WMDE> or whether there is a concrete request to change the db schema i propos |
| 169 | 21:32:35 <SMalyshev> I get an impression that jynus has to answer that :) |
| 170 | 21:33:17 <brion> jynus is always reluctant to use the veto power we keep wanting to give him :) |
| 171 | 21:33:18 <brion> be gentle |
| 172 | 21:33:27 <DanielK_WMDE> TimStarling: we can get the code ready for deployment while we are also working on, or deciding on, optimization strategies for revision storage. |
| 173 | 21:33:37 <TimStarling> I don't think we're going to get on board with jynus's idea of splitting the revision concept |
| 174 | 21:33:54 <TimStarling> but I think we should work by consensus |
| 175 | 21:34:03 <brion> *nod* |
| 176 | 21:34:24 <robla> is jynus's idea spelled out somewhere? |
| 177 | 21:34:36 <DanielK_WMDE> so if we want consensus but won't get on board with his idea, then we need to convince him?... |
| 178 | 21:34:45 <brion> we've got some bits of discussions, no concrete alt proposal |
| 179 | 21:34:59 <TimStarling> robla: no, not really, he was reluctant to dive in and do fully worked schema |
| 180 | 21:35:12 <TimStarling> DanielK_WMDE: right |
| 181 | 21:35:35 <DanielK_WMDE> i have tried and failed |
| 182 | 21:36:38 <robla> DanielK_WMDE: I think one thing that may be slowing this conversation down is it getting too bogged down in details |
| 183 | 21:36:59 <robla> there's a *lot* to sort through here: https://www.mediawiki.org/wiki/Multi-Content_Revisions |
| 184 | 21:37:04 <TimStarling> I don't want to get into detail about tactics in this discussion |
| 185 | 21:37:40 <TimStarling> how would it work to implement it but not deploy it? would you be able to have a feature flag in MW? or would it have to be a branch? |
| 186 | 21:37:52 <DanielK_WMDE> robla: yes, that's why I announced only the schema bit as today's topic: https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data |
| 187 | 21:37:58 <James_F> Branch or unmerged commit. |
| 188 | 21:38:21 <DanielK_WMDE> robla: that's already quite a bit, but I think it is managable. |
| 189 | 21:38:32 <bblack> where are we meant to have the bigger discussion? I just don't get artchitecting the details before having some consensus that this is the right model for some real use-cases. The use-cases section mentions its own speculative nature, many of them are more metadata than parallel separate content, which is an entirely simpler case to handle. the rest are questionable, IMHO... |
| 190 | 21:38:58 <bblack> maybe that's for my lack of information, but still |
| 191 | 21:39:01 <DanielK_WMDE> TimStarling: We will need feature flags for the migration/transition anyway. So, yes. |
| 192 | 21:39:22 <TimStarling> it would be nice to have say two initial use cases which will be initially implemented |
| 193 | 21:39:31 <DanielK_WMDE> TimStarling: hopefully, if/the/else cruft can be kept to a minimum be swapping in alternative implementation of the relevant components. |
| 194 | 21:40:13 * brion hmms |
| 195 | 21:40:18 <DanielK_WMDE> TimStarling: thw first two in the list: MediaInfo and PageAssessments. |
| 196 | 21:40:29 <James_F> That could work. |
| 197 | 21:40:39 <DanielK_WMDE> MassMessage is also a hot candidate I think |
| 198 | 21:40:44 <James_F> And TemplateData. ;-) |
| 199 | 21:40:55 <James_F> (As it's so simple.) |
| 200 | 21:41:00 <brion> ok i think i'm going to try fleshing out an alt proposal along some, but not all, of jynus and surrogate-jynus's lines, and we can just compare that |
| 201 | 21:41:06 <TimStarling> presumably we will have an MCR-aware API, and all the if/else will be in the implementation of that API |
| 202 | 21:41:08 <brion> it'll be good to have some key use cases to go along with that |
| 203 | 21:41:29 <TimStarling> RevisionLookup |
| 204 | 21:41:33 <DanielK_WMDE> bblack: if it's editable and versioned, it's not meta-data |
| 205 | 21:41:50 <brion> cause if we do concentrate on cases where the secondary slots are special kinds of data, maybe extra tables aren't too awful. but maybe they are ;) |
| 206 | 21:42:02 <DanielK_WMDE> TimStarling: yes, exactly |
| 207 | 21:42:34 <TimStarling> maybe we should start moving towards rev_id being opaque rather than an auto-increment integer |
| 208 | 21:42:55 <AaronSchulz> but still an integer? |
| 209 | 21:42:57 <DanielK_WMDE> brion: i'm not thinking of secondary (derived) slots any more. just primary user editable content. |
| 210 | 21:43:06 <TimStarling> a UUID might make more sense if it is sharded |
| 211 | 21:43:07 <brion> right, sorry wrong term :) |
| 212 | 21:43:13 <brion> i mean non-main-wikitext slots |
| 213 | 21:43:25 <TimStarling> but yes, still an integer initially |
| 214 | 21:43:30 <TimStarling> but maybe type-hinted as a string |
| 215 | 21:43:34 <DanielK_WMDE> TimStarling: or a time-uuid. gabriel loves those. |
| 216 | 21:43:38 <brion> TimStarling: for multi-master insert that can be important yes |
| 217 | 21:43:48 <DanielK_WMDE> But they are big. We are trying to make that table smaller, right? |
| 218 | 21:43:58 <brion> bigints are smaller :) |
| 219 | 21:44:00 <DanielK_WMDE> (but we are discussing the revision table again) |
| 220 | 21:44:02 <brion> i will just warn about Bigints and the JavaScript/node 53-bit limit though |
| 221 | 21:44:16 <tgr> brion: extra tables would need some PHP-layer abstraction on top of our current DB abstraction, for all code that needs to search or iterate all content. That seems scary. |
| 222 | 21:44:29 <DanielK_WMDE> tgr: very. |
| 223 | 21:44:30 <AaronSchulz> reminds me of https://gerrit.wikimedia.org/r/#/c/16696/20/includes/rdbstore/RDBStore.php |
| 224 | 21:44:45 * AaronSchulz almost forgot about that, haha |
| 225 | 21:44:49 <brion> tgr: yeah, at least some would need to add to the tables joined on things. others would not actually need to touch those tables, though, and would only care about what's in revision i think |
| 226 | 21:44:50 <TimStarling> brion: you are worried that we will exceed 2^53 rows in a table? ;) |
| 227 | 21:45:07 <AaronSchulz> (of course half of that was wild experimentation that would never be used) |
| 228 | 21:45:12 <brion> depends how fine-grained we make editing ;) |
| 229 | 21:45:16 <James_F> AaronSchulz: PTSD flashbacks to that code? ;-) |
| 230 | 21:45:17 <DanielK_WMDE> AaronSchulz: that's basically home grown partitioning, right? |
| 231 | 21:45:26 <SMalyshev> non-sequential revids may be problematic as it'd be impossible to know the order |
| 232 | 21:45:37 <brion> mianly i was thinking if we do something clever like a 64-bit mini uuid |
| 233 | 21:46:02 <DanielK_WMDE> i'm getting worried that I'm stranded with this with no way to actively move forward. |
| 234 | 21:46:11 <brion> yeah :( |
| 235 | 21:47:00 <DanielK_WMDE> can i at least get some feedback on "super tall content table" vs "not-so-tall content table + super tall slots table"? |
| 236 | 21:47:17 <brion> i am strongly in favor of super tall slots table |
| 237 | 21:47:22 <SMalyshev> I like the second one better |
| 238 | 21:47:23 <DanielK_WMDE> as in https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Re-using_Content_Rows |
| 239 | 21:47:24 <brion> lets us keep the content table much smaller |
| 240 | 21:47:26 <TimStarling> bblack: maybe you can discuss your concerns on https://www.mediawiki.org/wiki/Talk:Multi-Content_Revisions ? |
| 241 | 21:47:43 <SMalyshev> if we're going to have huge table, it's better to have it as "narrow" as possible |
| 242 | 21:47:46 <tgr> basically this proposal is blocked on deciding how to handle very tall tables, which is something that needs to be decided soon anyway, right? |
| 243 | 21:48:00 <tgr> so maybe just give up for now and make that decision happen as soon as possible? |
| 244 | 21:48:01 <DanielK_WMDE> ok. how about I work on some strawman code that allows us to look at the schema with some data in it, maybe on labs? |
| 245 | 21:48:13 <DanielK_WMDE> would that help, or would it be a waste of time? |
| 246 | 21:48:14 <subbu> is there a wiki page / talk page / phab task that discusses ops concerns with the MCR proposal? |
| 247 | 21:48:30 <robla> subbu: I'm not aware of any |
| 248 | 21:48:43 <DanielK_WMDE> tgr: i'm not sure there is a generic answer to that question. it may very much depend on the table. |
| 249 | 21:49:41 <DanielK_WMDE> subbu: there is one comment by jynus: https://www.mediawiki.org/wiki/Topic:Tb6fok3z43ar16fe |
| 250 | 21:49:58 <DanielK_WMDE> i frankly can't extract much guidance from it |
| 251 | 21:50:56 <TimStarling> "I will create an alternative one" -- maybe we just need to nag jynus to write that |
| 252 | 21:51:03 <robla> thanks for the refresher about the link, DanielK_WMDE |
| 253 | 21:51:16 <DanielK_WMDE> #link https://www.mediawiki.org/wiki/Topic:Tb6fok3z43ar16fe |
| 254 | 21:51:27 <DanielK_WMDE> TimStarling: please do, i'm quite curious |
| 255 | 21:52:01 <robla> well, on the tactics front, I'm hoping that ArchCom doesn't become NagCom ;-) |
| 256 | 21:52:08 <brion> heh |
| 257 | 21:52:21 <DanielK_WMDE> or ArgCom... |
| 258 | 21:52:47 <robla> I think it may be a useful conversation starter to *attempt* to come up with what jynus is shooting for |
| 259 | 21:53:00 <AaronSchulz> DanielK_WMDE: I'm up for discussing partitioning, since I still remember thinking about that a lot in the past. My inclination is tall-and-narrow metadata => sharded blobs though |
| 260 | 21:53:21 <DanielK_WMDE> robla: i honestly can't imagine how it would work. if i could, i would have propsoed it. |
| 261 | 21:53:50 <DanielK_WMDE> AaronSchulz: yes, i'm with you there. And I also think we should discuss sharding. |
| 262 | 21:53:53 <TimStarling> I mentioned some ideas about making revision narrower, jynus was receptive to those |
| 263 | 21:54:02 <DanielK_WMDE> AaronSchulz: who's going to drive that conversation? |
| 264 | 21:54:40 <TimStarling> like splitting out rev_comment, you know we have a bug to make rev_comment be larger than 255 bytes |
| 265 | 21:54:56 * AaronSchulz shrugs...probably would be good to know about what parameters jynus wants |
| 266 | 21:55:00 <brion> yeah rev_comment and rev_user_text are easy wins |
| 267 | 21:55:03 <Scott_WUaS> (Hoping all can keep this helpful conversation going - https://www.mediawiki.org/wiki/Topic:Tb6fok3z43ar16fe ) |
| 268 | 21:55:43 <DanielK_WMDE> AaronSchulz: yes, that would be good |
| 269 | 21:56:04 <DanielK_WMDE> TimStarling, brion: so, who's going to write an rfc about optimizing row size in rrevision? |
| 270 | 21:56:18 <James_F> TimStarling: Would we make rev_comment just another slot? |
| 271 | 21:56:41 <brion> i can do that if TimStarling isn't excited about it, we have some good ideas from last week's offline discussion |
| 272 | 21:57:13 * brion compacts ALL the rows! |
| 273 | 21:57:19 <DanielK_WMDE> brion: yay :) |
| 274 | 21:57:27 <TimStarling> ok brion, compact away, I will comment on it |
| 275 | 21:57:31 <DanielK_WMDE> i'm happy to help and give input, but i don't see me driving this |
| 276 | 21:57:33 <DanielK_WMDE> too much on my plate |
| 277 | 21:57:51 <brion> no worries |
| 278 | 21:57:55 <DanielK_WMDE> the problem with pausing MCR is: i have mde room for this in my schedule *now* |
| 279 | 21:57:56 <robla> #info 14:55:00 <brion> yeah rev_comment and rev_user_text are easy wins |
| 280 | 21:58:06 <DanielK_WMDE> if we drop this for 3 months, I have *no* idea when i can get back on working on it |
| 281 | 21:58:11 <brion> great i'll write those up next couple days |
| 282 | 21:58:15 <DanielK_WMDE> it also pushes back the sche4dule for structured commons |
| 283 | 21:58:31 <SMalyshev> do we need MCR for structured commons? |
| 284 | 21:58:39 <Scott_WUaS> (Thanks, All!) |
| 285 | 21:58:51 <SMalyshev> I mean need like "no way we can do structured commons without it"? |
| 286 | 21:58:56 <marktraceur> It would be super if we could not delay that again... |
| 287 | 21:59:06 <DanielK_WMDE> #info re "super tall content table" vs "not-so-tall content table + super tall slots table": <brion> i am strongly in favor of super tall slots table <SMalyshev> I like the second one better |
| 288 | 21:59:43 <brion> #info brion to write up additional RfC on compacting rows in revision table (should apply with or without MCR) |
| 289 | 22:00:06 <robla> ok...should we end the official part of this meeting on that? |
| 290 | 22:00:09 <DanielK_WMDE> brion: will partitioning be part of that? |
| 291 | 22:00:23 * robla plans to hit #endmeeting in 120 seconds |
| 292 | 22:00:36 <DanielK_WMDE> SMalyshev: pretty much, yes. |
| 293 | 22:00:41 <brion> DanielK_WMDE: not explicitly but i'll mention some related concerns |
| 294 | 22:01:06 <brion> can expand to that if we decide we must super-prioritize it |
| 295 | 22:01:12 <DanielK_WMDE> SMalyshev: at least if we want to stick to the product requirements as set out by the WMF back in the day. |
| 296 | 22:01:17 <robla> brion, thanks for taking that on! |
| 297 | 22:01:36 <TimStarling> DanielK_WMDE: well, you say you can implement it with a feature switch, which should be relatively uncontroversial |
| 298 | 22:01:36 <brion> :D |
| 299 | 22:01:36 <subbu> so, reading that talk page topic, iiuc, jynus is objecting to using a single unified table for all slots and prefers different tables for different slots? |
| 300 | 22:02:09 <robla> we can continue the conversation in #wikimedia-tech for those that want to |
| 301 | 22:02:27 <robla> thanks all! |
| 302 | 22:02:32 <robla> #endmeeting |
Other meetings
| Architecture meetings | ||
|---|---|---|
| 13:00 PT ArchCom Planning Meetings | upcoming | all since 2016-03-30 |
| 14:00 PT ArchCom-RFC Meetings | upcoming | all since 2015-09-09 |