Agenda
- Location: #wikimedia-office IRC channel
- Meeting type: Consensus for final comment (on the three options).
- Time: Weekly, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST)
- Topic: T105652 Content model storage
In this meeting, we discussed the storage of content model and format in the database. A proposal (T105652) was approved a year ago, but never implemented. Three options were discussed:
- Implement T105652: RfC: Content model storage as originally approved a year ago, with new columns in the page, revision, and archive tables.
- Implement a modified version, T142980: RFC: Create a content meta-data table, with one extra table but no new columns, to also cater to the needs of multi content revisions.
- take a step back and reconsider.
We agreed that Daniel should continue to develop the plan articulated in T142980.
Meeting summary
- Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ (robla, 21:03:15)
- LINK: https://phabricator.wikimedia.org/E261 (robla, 21:03:31)
- LINK: https://phabricator.wikimedia.org/T105652 (robla, 21:03:47)
- LINK: https://phabricator.wikimedia.org/T105652 (DanielK_WMDE_, 21:03:48)
- LINK: https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_, 21:03:54)
- LINK: https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_ 's revised proposal) (robla, 21:04:16)
- primary question to resolve: do a) legoktm 's original T105652 b) DanielK_WMDE_ 's modification T142980 c) none of the above (stay with status quo) (robla, 21:06:41)
- 14:08:34Â <legoktm>Â my original plan wasn't to do joins, but to store the id => string mapping in a cache like APC since it would be mostly static once initialized (robla, 21:09:24)
- ar_rev_id is not fully populated on enwiki. we can assign fresh revision ids though (and bump rev_id accordingly) (DanielK_WMDE_, 21:15:31)
- <DanielK_WMDE_> we will need to construct legacy rows eventually, when we move the blob address into the content table. (brion, 21:19:19)
- 14:16:31Â <jynus>Â the main issues happen when dataset doesn't fit into memory, which is exactly what I blocked (as the initial rolling in was going to do) (robla, 21:19:28)
- Discussion of DanielK_WMDE_'s question: "can I assume that there is still consensus on representing content model and format as integers, and have a mappoing in the db and in memory?" (robla, 21:23:20)
- <brion>Â jynus: I always hear enums are cheap to change. Lies? :) <jynus>Â they are cheap to add [...]Â but if you want to delete, it would be one of our most complex changes (robla, 21:26:14)
- re managing ids for content models etc: on a cache miss, check the db. if the db doesn't have it, add it. (DanielK_WMDE_, 21:26:26)
- tentative agreement to content model and format as int; we rule out option (c) then (DanielK_WMDE_, 21:28:38)
- <jynus>Â I think a) -> b) is easy to do, why do we want to do b directly (genune question) (robla, 21:30:18)
- <DanielK_WMDE_>Â we have not discussed whether the new table should just have the minimum fields for now, or the full set needed for MCR <jynus>Â adding new columns on a small table with low traffic is easy (robla, 21:53:42)
- 14:58:08Â <TimStarling>Â can I just repeat that I am putting my 2c in for MCR fields in the initial content table, with slot=1 always (robla, 21:59:12)
Meeting ended at 22:04:31 UTC.
People present (lines said)
- DanielK_WMDE_ (98)
- jynus (87)
- brion (45)
- TimStarling (33)
- robla (27)
- gwicke (26)
- anomie (20)
- legoktm (13)
- stashbot (9)
- James_F (4)
- SMalyshev (3)
- wm-labs-meetbot` (3)
- Scott_WUaS (3)
- ori (2)
- tgr (1)
- aude (1)
Full log
1 | 21:03:05 <robla> #startmeeting ArchCom content model storage (T105652) |
---|---|
2 | 21:03:05 <wm-labs-meetbot`> Meeting started Wed Aug 17 21:03:05 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. |
3 | 21:03:05 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. |
4 | 21:03:05 <wm-labs-meetbot`> The meeting name has been set to 'archcom_content_model_storage__t105652_' |
5 | 21:03:05 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652 |
6 | 21:03:15 <robla> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ |
7 | 21:03:31 <robla> #link https://phabricator.wikimedia.org/E261 |
8 | 21:03:47 <robla> #link https://phabricator.wikimedia.org/T105652 |
9 | 21:03:48 <DanielK_WMDE_> #link https://phabricator.wikimedia.org/T105652 |
10 | 21:03:54 <DanielK_WMDE_> #link https://phabricator.wikimedia.org/T142980 |
11 | 21:03:58 <DanielK_WMDE_> oops :) |
12 | 21:04:08 * aude waves |
13 | 21:04:16 <robla> #link https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_ 's revised proposal) |
14 | 21:04:33 <TimStarling> smallint = 16 bits? |
15 | 21:04:46 <DanielK_WMDE_> So, today I would like to discuss if and how we want to modify the way we store meta-data about revision content, in particular, how and where we store the content model and format. |
16 | 21:04:56 <DanielK_WMDE_> The RFC legoktm proposed last year aimed to make the storage of content model and format more efficient (T105652). I'm concerned that the solution that was approved then would need to be reverted to add support for multiple content slots per revision (T107595). |
17 | 21:04:57 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652 |
18 | 21:04:57 <stashbot> T107595: [RFC] Multi-Content Revisions - https://phabricator.wikimedia.org/T107595 |
19 | 21:05:06 <James_F> TimStarling: Yes, per https://dev.mysql.com/doc/refman/5.5/en/integer-types.html |
20 | 21:05:23 <DanielK_WMDE_> While the problems are unrelated, the solutions overlap. So I propose to kill two birds with one stone, and add a new table for content meta-data that will use the new efficient way to represent content model anf format (T142980). |
21 | 21:05:24 <stashbot> T142980: RFC: Create a content meta-data table - https://phabricator.wikimedia.org/T142980 |
22 | 21:05:28 <DanielK_WMDE_> The new table will also allow us to have more than one content "slot" per revision. And we won't have to add any columns to page, revision, or archive. |
23 | 21:05:51 <DanielK_WMDE_> My goal for this meeting is to decide whether we want to implement legoktm's original proposal, or the modified one with the extra table. |
24 | 21:05:59 <gwicke> is jynus around? |
25 | 21:06:13 <jynus> I am |
26 | 21:06:22 <gwicke> ah, great! |
27 | 21:06:41 <robla> #info primary question to resolve: do a) legoktm 's original T105652 b) DanielK_WMDE_ 's modification T142980 c) none of the above (stay with status quo) |
28 | 21:06:41 <brion> So I'm not sure we'd need to add anything to page anyway (can join to revision) but the duplication between archive and revision is annoying and I like removing it by partially normalizing |
29 | 21:06:42 <DanielK_WMDE_> do we have jcrespo here? |
30 | 21:06:43 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652 |
31 | 21:06:43 <stashbot> T142980: RFC: Create a content meta-data table - https://phabricator.wikimedia.org/T142980 |
32 | 21:06:57 <jynus> I am that that you call jcrespo |
33 | 21:07:08 <gwicke> jynus: do you expect us to use compression in the foreseeable future, and do you expect the extra joins to be faster than decompression? |
34 | 21:07:21 <jynus> compression? |
35 | 21:07:22 <DanielK_WMDE_> jynus: ah, sorry. good to have you here! |
36 | 21:07:30 <gwicke> oh, right... /me is clearly consused |
37 | 21:07:32 <gwicke> ;) |
38 | 21:07:40 <Scott_WUaS> :) |
39 | 21:07:45 <gwicke> confused, even |
40 | 21:07:59 <gwicke> anyway, that would be my main question about either proposal |
41 | 21:08:21 <robla> jynus: thanks for being here this evening! |
42 | 21:08:27 <jynus> do you mean normalization? |
43 | 21:08:34 <legoktm> my original plan wasn't to do joins, but to store the id => string mapping in a cache like APC since it would be mostly static once initialized |
44 | 21:08:50 <TimStarling> there are a few things that concern me |
45 | 21:09:05 <DanielK_WMDE_> yea, i also forsee no joins for resolving the numeric ids. |
46 | 21:09:22 <DanielK_WMDE_> if we have an extra content table, we'd have an extra join though, for many use cases. |
47 | 21:09:24 <robla> #info 14:08:34 <legoktm> my original plan wasn't to do joins, but to store the id => string mapping in a cache like APC since it would be mostly static once initialized |
48 | 21:09:26 <gwicke> without joins, it sounds like we would need to manually maintain some mapping |
49 | 21:09:32 <brion> So just joins on (rvision, content) |
50 | 21:09:33 <TimStarling> one is whether adding a new table with hundreds of millions of rows is justified, considering the performance implications due to loss of locality, compared to the proposal on the wiki page |
51 | 21:09:48 <gwicke> how would this work for extensibility? |
52 | 21:09:57 <DanielK_WMDE_> TimStarling: if we want MCR, we will have to do that anyway, sooner or later |
53 | 21:10:03 <TimStarling> another is the fact that ar_rev_id is not fully populated, there are 500k rows in enwiki.archive with ar_rev_id=null |
54 | 21:10:07 <gwicke> i.e, custom models etc |
55 | 21:10:16 <brion> gwicke: similar to namespace ids, though hopefully more consistently managed |
56 | 21:10:31 <brion> Or else some config map |
57 | 21:10:40 <gwicke> so we'd reserve numeric ranges to avoid conflicts? |
58 | 21:10:42 <jynus> whatever that doesn't require storing usless strings will be faster |
59 | 21:10:52 <TimStarling> some of those rows were created before the MW 1.5 and so there was never any rev_id for them in the first place |
60 | 21:10:53 <brion> TimStarling: good point - but those are all legacy rows with default content model |
61 | 21:11:01 <brion> So perhaps ok to have no matching row |
62 | 21:11:01 <DanielK_WMDE_> TimStarling: could we just assign them an unused rev_id? |
63 | 21:11:25 <brion> Heh |
64 | 21:11:28 <TimStarling> yes |
65 | 21:11:42 <DanielK_WMDE_> i'd vote for that, then |
66 | 21:12:05 <brion> That raises the specter perhaps of killing the demoralization between archive and revision by folding archive into rvision. That's a bigger issue tho |
67 | 21:12:24 <TimStarling> excellent freduian slip there |
68 | 21:12:28 <jynus> brion, I am all for that, but maybe out of the scope of this RFC |
69 | 21:12:29 <DanielK_WMDE_> brion: yea, i didn't really want to touch that today |
70 | 21:12:30 <robla> :-) |
71 | 21:12:34 <brion> Yeah |
72 | 21:12:40 <gwicke> could we test the performance options ahead of time, to validate our intuition? |
73 | 21:12:41 <brion> Baby steps! |
74 | 21:12:41 <legoktm> (there's a different RfC for that! https://www.mediawiki.org/wiki/Requests_for_comment/Page_deletion) |
75 | 21:13:19 <jynus> gwicke, sure, I can setup a demo if you need it, if you provide the code |
76 | 21:13:26 <brion> Gwicke good idea perhaps, since we expect to need this second table for multi content revisions in future even if we don't start itf off now |
77 | 21:13:40 <brion> Would be a similar but slightly different join |
78 | 21:13:49 <brion> On th rev id and the role |
79 | 21:14:03 <gwicke> do we have a db with a large-enough dataset that we could play with? |
80 | 21:14:16 <tgr> DanielK_WMDE_: unused or reserved? setting ar_rev_id to something that later gets assigned to an unrelated revision seems like asking for trouble |
81 | 21:14:22 <jynus> but, please do not fear joins without reasoning (even if your favorite storage system does not support them) |
82 | 21:14:28 <DanielK_WMDE_> in some cases, we will be able to join page directly to content, and skip the revision table. in such cases, we'd not even add a join. |
83 | 21:14:45 <TimStarling> just insert rows into revision and immediately delete them |
84 | 21:14:57 <TimStarling> or maybe even insert and rollback, maybe that works |
85 | 21:15:01 <gwicke> jynus: not fear, but it's good to measure before making decisions |
86 | 21:15:04 <DanielK_WMDE_> tgr: rev_id would have to be bumpt to something greater than any of the ids we used for ar_rev_id. |
87 | 21:15:21 <brion> Aaaaanyway were not doing that bit yet ;) |
88 | 21:15:31 <DanielK_WMDE_> #info ar_rev_id is not fully populated on enwiki. we can assign fresh revision ids though (and bump rev_id accordingly) |
89 | 21:15:39 <jynus> gwicke, sure, although I can guarantee no slowdown, not a huge improvement right now |
90 | 21:16:17 <gwicke> I'm especially curious if generic compression would achieve the same effect with less effort |
91 | 21:16:18 <DanielK_WMDE_> TimStarling: i think you can even just insert the number you want, and the auto.increment keeps going after that. so just insert & delete one row. |
92 | 21:16:31 <jynus> the main issues happen when dataset doesn't fit into memory, which is exactly what I blocked (as the initial rolling in was going to do) |
93 | 21:16:55 <brion> Ah DanielK_WMDE_ that reminds me |
94 | 21:16:57 <gwicke> memory being the page cache? |
95 | 21:17:08 <gwicke> or result set held in memory? |
96 | 21:17:20 <brion> Do we need legacy rows or are empty left joins assumed to mean namespace default? |
97 | 21:17:36 * jynus recommends gabriel reading about InnoDB buffer pool |
98 | 21:18:10 <gwicke> jynus: I am aware of that, but am not sure if we configure that to use most memory, or rely on page cache |
99 | 21:18:17 <DanielK_WMDE_> brion: we will need to construct legacy rows eventually, when we move the blob address into the content table. |
100 | 21:18:18 <gwicke> also, does it hold compressed pages, or decompressed ones? |
101 | 21:18:28 <DanielK_WMDE_> brion: whether we ewant/need it right away is up for discussion. |
102 | 21:18:40 <brion> Ah good |
103 | 21:18:54 <jynus> yes, this is the only thing I can give to this discussion |
104 | 21:19:08 <jynus> doing the schema change |
105 | 21:19:19 <brion> #info <DanielK_WMDE_> we will need to construct legacy rows eventually, when we move the blob address into the content table. |
106 | 21:19:20 <TimStarling> seems like this content table including MCR will be functionally equivalent to the text table |
107 | 21:19:27 <jynus> from one version to the other |
108 | 21:19:28 <robla> #info 14:16:31 <jynus> the main issues happen when dataset doesn't fit into memory, which is exactly what I blocked (as the initial rolling in was going to do) |
109 | 21:19:36 <jynus> is trivial |
110 | 21:19:41 <brion> Hmmmm |
111 | 21:19:42 <jynus> if |
112 | 21:19:42 <TimStarling> except with a one-to-many mapping from revision to content |
113 | 21:19:48 <jynus> a) the table is small |
114 | 21:19:53 <jynus> which I think it will be |
115 | 21:19:55 <James_F> TimStarling: Yes. |
116 | 21:20:04 <brion> TimStarling: do you envision changing text table instead? |
117 | 21:20:07 <jynus> b) the transactions that use them are small |
118 | 21:20:20 <DanielK_WMDE_> ok... can I assume that there is still consensus on representing content model and format as integers, and have a mappoing in the db and in memory? This was approved last year with legoktm's original proposal. |
119 | 21:20:28 <jynus> this is problematic, for example, for tables such as revision or commons image |
120 | 21:20:37 <brion> DanielK_WMDE_: yes I've been assuming that holds :) |
121 | 21:20:44 <anomie> TimStarling: Well, the content table doesn't allow storing the revision data directly. It's either revision→content→text, or revision→content→(external something) |
122 | 21:20:54 <jynus> this is the only thing that I can say that may be relatively helpful, the rest is for you to decide |
123 | 21:20:57 <TimStarling> brion: no |
124 | 21:21:09 <brion> ok |
125 | 21:21:32 <gwicke> DanielK_WMDE_: I'm still wishing we had performance data to back up the assertion that this is the most efficient way to improve performance, and is worth the hassle of managing ids manually |
126 | 21:21:35 <TimStarling> anomie: right, so you still need the text table, but do you keep text_flags etc.? |
127 | 21:21:57 <DanielK_WMDE_> gwicke: nobody is suggesting to manage them manually. |
128 | 21:22:04 <TimStarling> does ExternalStore continue to work only with text rows, or can it also work with content rows? |
129 | 21:22:18 <DanielK_WMDE_> gwicke: if no mapping is found in the in-memory mapping, you assign a new one from an auto-increment field. done. |
130 | 21:22:19 <brion> gwicke: so the alternatives are probably enums or another join to a table of mappings. They're all roughly equivalent theoretically but may perform different dunno |
131 | 21:22:26 <gwicke> DanielKWMDE: previous suggestion was to set them up in the config like namespaces |
132 | 21:22:44 <jynus> say no to enums on a table as large as revision |
133 | 21:22:50 <jynus> ok in other cases |
134 | 21:23:12 <DanielK_WMDE_> gwicke: that's not how i understand the approved rfc. let me check again... |
135 | 21:23:14 <brion> jynus: I always hear enums are cheap to change. Lies? :) |
136 | 21:23:20 <robla> #info Discussion of DanielK_WMDE_'s question: "can I assume that there is still consensus on representing content model and format as integers, and have a mappoing in the db and in memory?" |
137 | 21:23:23 <anomie> TimStarling: Is text_flags used for anything when text.old_text actually contains the content instead of an external-store address? If so, then it would still be used for that purpose when the text table is used at all. When external-store is used for everything, the address would be in cont_address and the text table shouldn't be needed at all. |
138 | 21:23:24 <jynus> they are cheap to add, brion |
139 | 21:23:24 <gwicke> so compression is definitely not achieving similar savings? |
140 | 21:23:31 <jynus> (items) |
141 | 21:23:53 <brion> Heh |
142 | 21:23:54 <jynus> but if you want to delete, it would be one of our most complex changes |
143 | 21:23:57 <anomie> (although "external store" is now named "blob store", I think) |
144 | 21:23:58 <brion> Yikes |
145 | 21:24:17 <legoktm> gwicke: well, yeah, storing them in config like namespaces would be more performant. but then you end up with a wikipage where extensions write down the ids their content models use, and you just cross your fingers and hope you don't conflict with anyone. Anyways, we discussed and rejected that last year.... |
146 | 21:24:25 <TimStarling> anomie: 1. yes, compression and legacy charset mapping 2. good question |
147 | 21:24:54 <DanielK_WMDE_> gwicke: hm, the old rfc doesn't really say. but it would be easy to do, i already wrote some code for this. no need for manually managing ids. |
148 | 21:25:07 <gwicke> DanielKWMDE, legoktm: okay, so it would be stored in the db, but some background task would stash the db data into some cache & update that when needed? |
149 | 21:25:35 <jynus> so, I said this many times: do not fear joins just for the sake of an extra table (but I do *not* care, config/table/whatever) |
150 | 21:25:35 <DanielK_WMDE_> gwicke: no background task. on a cache miss, check the db. if the db doesn't have it, add it. done. |
151 | 21:25:40 <brion> Why join when you can put a blob in memcached :) |
152 | 21:25:58 <legoktm> what DanielK_WMDE_ said. |
153 | 21:25:59 <brion> Quite effective for small sets like this yeah |
154 | 21:26:01 <gwicke> DanielK_WMDE_: that's "when needed" ;) |
155 | 21:26:14 <robla> #info <brion> jynus: I always hear enums are cheap to change. Lies? :) <jynus> they are cheap to add [...] but if you want to delete, it would be one of our most complex changes |
156 | 21:26:15 <gwicke> but yeah, that sounds doable |
157 | 21:26:26 <DanielK_WMDE_> #info re managing ids for content models etc: on a cache miss, check the db. if the db doesn't have it, add it. |
158 | 21:27:14 <jynus> for example, 2 selects, I guarantee you will be slower than 1 single query with a join |
159 | 21:27:20 <brion> I think people working with raw db replicas would appreciate having the mapping in a db table even if it's not used for joins in production |
160 | 21:27:38 <jynus> but again, not agains memcache/config/whatever |
161 | 21:27:52 <DanielK_WMDE_> brion: also, you need to persist the mapping, in case you lose the cache |
162 | 21:27:54 <DanielK_WMDE_> so... tentative agreement to content model and format as int? can we rule out option (c) then, and look at (a) vs (b)? |
163 | 21:27:58 <brion> Yeah |
164 | 21:28:09 <brion> Definitely |
165 | 21:28:12 <ori> well, ... |
166 | 21:28:15 <ori> (just kidding.) |
167 | 21:28:17 <brion> Lol |
168 | 21:28:20 <DanielK_WMDE_> heh :P |
169 | 21:28:38 <DanielK_WMDE_> #info tentative agreement to content model and format as int; we rule out option (c) then |
170 | 21:28:39 <jynus> which link has a b and c, sorry? |
171 | 21:28:55 <DanielK_WMDE_> jynus: <robla> #info primary question to resolve: do a) legoktm 's original T105652 b) DanielK_WMDE_ 's modification T142980 c) none of the above (stay with status quo) |
172 | 21:29:03 <gwicke> at this point I'm assuming that compression was tested & found to not be competitive |
173 | 21:29:04 <TimStarling> if you remember, the use of strings in the first place was quite a fraught compromise |
174 | 21:29:17 <legoktm> Yes, let's rule out c) |
175 | 21:29:19 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652 |
176 | 21:29:23 <TimStarling> I asked for integers from the outset, and Daniel bluntly refused |
177 | 21:29:36 <TimStarling> so obviously yes, I am still in favour of using integers |
178 | 21:29:39 <stashbot> T142980: RFC: Create a content meta-data table - https://phabricator.wikimedia.org/T142980 |
179 | 21:29:45 <jynus> I think a) -> b) is easy to do, why do we want to do b directly (genune question) |
180 | 21:29:59 <DanielK_WMDE_> TimStarling: o_O i actually had that implemented, and changed it upon request... from... uh... i don't recall. |
181 | 21:30:12 <DanielK_WMDE_> TimStarling: my implementation did expose the in ids though. not as nice as we are planning for now |
182 | 21:30:18 <robla> #info <jynus> I think a) -> b) is easy to do, why do we want to do b directly (genune question) |
183 | 21:30:19 <jynus> may main concern is to get too blocked on a relatively complex feature |
184 | 21:30:21 <TimStarling> yeah, you started with integers, switched to strings, then came to me and I said "use integers" |
185 | 21:31:04 <TimStarling> and you said it was some reviewer who told you to use strings, and I said go back to them and tell them I said use integers |
186 | 21:31:09 <anomie> jynus: To avoid having to change the primary key on the table from (cont_revision) to (cont_revision,cont_role) once the table has millions of rows. You'd probably be in the best position to say whether that'd be a big deal or not. |
187 | 21:31:26 <TimStarling> and you said you didn't really like integers in the first place |
188 | 21:31:28 <jynus> are we talking about the small table? |
189 | 21:31:37 <DanielK_WMDE_> TimStarling: i kind of remember it the other way... but whatever. no need to fight about that now. let's get rid of them. |
190 | 21:31:43 <jynus> the one that contains formats? |
191 | 21:31:50 <brion> Ok so something like b will be needed later for multi content, but not strictly yet. If transitions from current to a, and a to b-prime are easy, then there's not a huge need to start with b |
192 | 21:31:55 <DanielK_WMDE_> yea, i didn't like them to be exposed... |
193 | 21:32:07 <anomie> jynus: Oh, wait. Actually, (a) is putting rev_format_id and rev_model_id into revision, while (b) is making the content table. |
194 | 21:32:38 <jynus> then I vote for d |
195 | 21:32:51 * anomie got confused between (a) and (b)'s basic versus medium versions |
196 | 21:33:01 <DanielK_WMDE_> jynus: what's (d)? |
197 | 21:33:22 <jynus> stick to the original plan, then introduce the slot stuff afterwards |
198 | 21:33:52 <jynus> I do not think a and be are exclusive? |
199 | 21:33:58 <DanielK_WMDE_> jynus: a) -> b) involved assing columns to page, revision, and archive, and then removing the same columns again later. i assumed that's not something you like. |
200 | 21:34:09 <anomie> jynus: So that's adding rev_format_id/rev_model_id (and similar to other tables) now, then later on make the content table and migrate to that? |
201 | 21:34:11 <DanielK_WMDE_> it's defintly disruptive for tools on labs, and for extensions |
202 | 21:34:27 <jynus> WHAT? |
203 | 21:34:46 <jynus> ah, the extension thing could be a valid reason |
204 | 21:35:04 <legoktm> er, why is it disruptive for extensions? |
205 | 21:35:05 <jynus> finally I get a good answer to my original question |
206 | 21:35:20 <jynus> I do not know if it is valid, the rest can tell me |
207 | 21:35:23 <DanielK_WMDE_> it also means messing with the same code again, once changing it to use a sifferent column, then changing it to use a different table |
208 | 21:35:55 <DanielK_WMDE_> legoktm: for extensions that look at the database directly. rare, i agree. not rare for tools on labs. |
209 | 21:35:57 <jynus> wait, but the original change doesn't involve a schema change, right? |
210 | 21:36:01 <DanielK_WMDE_> actually, it's the reason we have labs |
211 | 21:36:07 <jynus> just a new table? |
212 | 21:36:31 <DanielK_WMDE_> jynus: the original change calls for columns to be added to page, revision, and archive. two columns each. |
213 | 21:36:37 <DanielK_WMDE_> to the largest tables we have |
214 | 21:36:40 <legoktm> DanielK_WMDE_: when I grepped a year ago, the only places that queried those columns directly was in core |
215 | 21:36:54 <DanielK_WMDE_> jynus: which would then be removed again, when we have the content table. |
216 | 21:37:12 <jynus> weren't we going to reuse the existing columns and redefine its meaning? |
217 | 21:37:16 <DanielK_WMDE_> legoktm: good to know, yea. extensions probably wouldn't |
218 | 21:37:20 <DanielK_WMDE_> jynus: no. |
219 | 21:37:25 <SMalyshev> if we plan MCRs anytime soon, I think doing work that will have to be redone with MCRs now is not smart |
220 | 21:37:34 <DanielK_WMDE_> jynus: that would be even more disruptive |
221 | 21:37:45 <James_F> SMalyshev: We do. |
222 | 21:37:45 <jynus> but I will do that work on production (?) |
223 | 21:37:55 <jynus> and labs |
224 | 21:38:12 <jynus> obiously you are the code magicians |
225 | 21:38:35 <DanielK_WMDE_> jynus: the problem with labs is not applying the schema change. the problem is braking the tools that use the schema. |
226 | 21:38:51 <DanielK_WMDE_> more changes -> more breakage |
227 | 21:39:00 <brion> SMalyshev: I think transition work would be similar difficulty in either case on the mw internals |
228 | 21:39:05 <jynus> I call you this now, labs is not an issue |
229 | 21:39:14 <jynus> and MCR will break it anyway |
230 | 21:39:19 <DanielK_WMDE_> James_F: we do what? |
231 | 21:39:29 <legoktm> DanielK_WMDE_: I don't think we've ever promised database stability for labs. and I don't think we should worry about it tbh. |
232 | 21:39:35 <legoktm> database schema stability* |
233 | 21:39:36 <jynus> +1 |
234 | 21:39:54 <jynus> labs is too broken anyway (do not quote me on that) |
235 | 21:39:56 <SMalyshev> brion: can't we make a model now that will make it easier? I mean sure we'll have to do work, but we could prepare for it |
236 | 21:40:02 <jynus> (it is all DBA's fault) |
237 | 21:40:25 <DanielK_WMDE_> i'd like incremental steps, instead of doing one step, then undoing half of it again to do the second step. |
238 | 21:40:34 <jynus> I think the important question is |
239 | 21:40:36 <anomie> jynus: All the options add two small tables to map int->string for model and for format. Then option (a) adds new int columns to several tables, populates them based on the existing string columns, then drops the string columns. Option (b) puts the new int columns into a "content" table with a FK back to the revision/archive tables, populates it from the existing string columns, then drops the string columns. Eventually we'll need to do option |
240 | 21:40:36 <anomie> (b) for the multi-content revisions anyway. |
241 | 21:40:52 <jynus> independetly of what could be done |
242 | 21:41:06 <jynus> who is willing to work on this? |
243 | 21:41:11 <jynus> on both cases |
244 | 21:41:13 <jynus> ? |
245 | 21:41:35 <jynus> or on either case, I mean |
246 | 21:41:41 <DanielK_WMDE_> me for option (b). not sure i can justify spending time on option (a), but i might. |
247 | 21:41:48 <SMalyshev> willing as in "interested" or "has time" or 1&2? ;) |
248 | 21:41:51 <DanielK_WMDE_> i can definitly help with reviewing in either case |
249 | 21:42:00 <brion> :) |
250 | 21:42:01 <jynus> (note that I will do what devs told me, no matter what, infrastructure wise) |
251 | 21:42:10 <anomie> There's a side question as to whether option (b)'s primary key should start out as just (cont_revision), or if we should make it (cont_revision,cont_role) right away and just let cont_role always be 1 until we make the infrastructure for different values. |
252 | 21:42:28 <brion> I'll pitch in if needed either way, but I've got other projects backing up :) |
253 | 21:42:36 <legoktm> I am volunteering to do the implementation for (a), and can help review (b)/whatever |
254 | 21:42:49 <DanielK_WMDE_> anomie: actually, i updated that - (cont_revision,cont_role) would be a unique key, we'd add an auto-increment field as a primary, for later use |
255 | 21:43:24 <DanielK_WMDE_> legoktm: would you also be willing to do the bit that is the same for (a) and (b), namely the actual mapping stuff? |
256 | 21:43:25 <anomie> DanielK_WMDE_: Any particular reason to add an id field instead of a two-int PK? |
257 | 21:44:06 <DanielK_WMDE_> anomie: yes, we can re-use content rows for multiple revisions that way. some slots will update only rarely. it's another bit of normalization. |
258 | 21:44:10 <jynus> give your thoughts, I do not have much to add, I personally incline towards small incremental changes if it was possible, but I cannot fairly enter here if I am not going to work on the code |
259 | 21:44:15 <DanielK_WMDE_> anomie: we can also add that later, but the table is big. |
260 | 21:44:22 <brion> So as long as we have realistic transition plan, and there's no major db-related reason to prefer one or the other, my main concern is we get the updates done. |
261 | 21:44:49 <brion> If we make the content table in a way that we don't have to change it when adding multi content then that is a certain niceness |
262 | 21:44:58 <DanielK_WMDE_> jynus: i'm with you regarding incremental changes. i just feel this isn't incremental, but two forward, one back, two forward... that's what i'm trying to avoid |
263 | 21:45:05 <legoktm> DanielK_WMDE_: yeah. |
264 | 21:45:08 <anomie> DanielK_WMDE_: How would that (re-using content rows) work? |
265 | 21:45:10 <TimStarling> small changes would be fine except that DanielK_WMDE_ is working on this because he is interested in MCR |
266 | 21:45:34 <TimStarling> putting the MCR fields in the table from the outset would better respect that motivation |
267 | 21:45:35 <jynus> TimStarling, but he just said he wouldn't help on b, only lego would on a |
268 | 21:45:39 <DanielK_WMDE_> anomie: another table, relating revisions to content. it's an option for later though. |
269 | 21:46:02 <jynus> sorry, I got confused, but I hope I got understood |
270 | 21:46:20 <DanielK_WMDE_> jynus: i would help with a, but i probably can't drive it. |
271 | 21:46:22 <TimStarling> <DanielK_WMDE_> me for option (b). not sure i can justify spending time on option (a), but i might. |
272 | 21:46:25 <brion> legoktm: any concerns on later difficulties if we do two steps? (A, then later b when we need multi)? |
273 | 21:46:34 <anomie> DanielK_WMDE_: All problems can be solved by adding another layer of indirection? ;) |
274 | 21:46:38 <TimStarling> Daniel will work on option b because it is a step towards MCR, which is fair enough |
275 | 21:46:50 <jynus> DanielK_WMDE_, but does b include all a functionality, would you work on a's functionality on b= |
276 | 21:46:51 <James_F> It's the Java way™! Oh, wait. ;-) |
277 | 21:46:51 <legoktm> brion: not off the top of my head |
278 | 21:46:54 <DanielK_WMDE_> anomie: and then you have a problem pointer... |
279 | 21:46:57 <brion> Ok |
280 | 21:47:29 <DanielK_WMDE_> jynus: i would, but i'd be greatful if legoktm would help |
281 | 21:47:41 <jynus> ok, what does lego have to say about that? |
282 | 21:47:44 * robla plans to move this RFC (T105652) to the "in progress" column on the ArchCom-RFC board at the conclusion of this meeting, pointing to this meeting (E261) |
283 | 21:47:51 * anomie wonders whether the complexity of revision→revision_content_mapping→content would be worth the savings in not duplicating cont_address and other fields. |
284 | 21:48:12 <DanielK_WMDE_> anomie: yes, i wonder about that too. it's an option, not a plan |
285 | 21:48:31 <legoktm> me helping in b)? I think so yeah |
286 | 21:48:38 <DanielK_WMDE_> jynus: <legoktm> I am volunteering to do the implementation for (a), and can help review (b)/whatever |
287 | 21:48:48 <DanielK_WMDE_> \o/ |
288 | 21:48:53 <brion> Woo |
289 | 21:48:58 <robla> it seems to me there is very cautious consensus around option "b", with some skepticism about who will do the work |
290 | 21:49:01 <jynus> ok, we still have anomies issue |
291 | 21:49:06 <jynus> what about that |
292 | 21:49:17 <DanielK_WMDE_> robla: the tricky part is the migration code |
293 | 21:49:21 <jynus> can the scope be reduced to comtemplate that? |
294 | 21:49:24 <DanielK_WMDE_> the rest should be pretty easy |
295 | 21:49:43 <DanielK_WMDE_> jynus: which issue? |
296 | 21:49:55 <jynus> anomie wonders whether the complexity of revision→revision_content_mapping→content would be worth the savings in not duplicating cont_address and other fields. |
297 | 21:49:59 <robla> DanielK_WMDE_: that's why I'm proposing "in progress" as a state, rather than "approved" |
298 | 21:50:41 <DanielK_WMDE_> jynus: the current proposal is revision -> content. revision -> r-c-mapping -> content is a possibility for later |
299 | 21:50:45 <DanielK_WMDE_> we are not committing to that |
300 | 21:50:48 <TimStarling> I don't know what revision_content_mapping is |
301 | 21:51:25 <DanielK_WMDE_> TimStarling: a way to re-use content entries for multiple revisions. outside the scope of this rfc. |
302 | 21:51:40 <anomie> For the record, I'm sure I'll do some stuff on this, although whether that stuff involves writing code or just reviewing it I don't know at this point. Too many people all writing code can get in each others' way. |
303 | 21:51:43 <DanielK_WMDE_> it might be nice, or it might be horrible to go that way, not sure yet. |
304 | 21:51:48 <TimStarling> ah right |
305 | 21:51:53 <DanielK_WMDE_> anomie: thanks! |
306 | 21:52:03 <jynus> ok, we need more "buts", anyone? |
307 | 21:52:12 <TimStarling> well, the existing rev_text_id allows that, it is used for null revisions |
308 | 21:52:19 <DanielK_WMDE_> we have not discussed whether the new table should just have the minimum fields for now, or the full set needed for MCR |
309 | 21:52:34 <DanielK_WMDE_> but I guess I can sort that out with jynus later. or we have another session on that |
310 | 21:52:46 <jynus> adding new columns on a small table with low traffic is easy |
311 | 21:53:16 <TimStarling> it will be a small table? with left joins? |
312 | 21:53:18 <DanielK_WMDE_> TimStarling: that allows the re-used of blobs for multiple content entries. not quite the same. but yea - re-using content meta-data may not be wirth the trouble. |
313 | 21:53:30 <jynus> (it is a bit of a simplificatin, go to https://wikitech.wikimedia.org/wiki/Schema_changes for the full version |
314 | 21:53:39 <DanielK_WMDE_> jynus: the content table is going to be LARGE! |
315 | 21:53:42 <anomie> jynus: What about on a table with individually small rows, but with as many rows as the revision table? |
316 | 21:53:42 <robla> #info <DanielK_WMDE_> we have not discussed whether the new table should just have the minimum fields for now, or the full set needed for MCR <jynus> adding new columns on a small table with low traffic is easy |
317 | 21:53:43 <brion> It I'll have lots of entries but small rows |
318 | 21:53:45 <TimStarling> for MCR it presumably needs to be fully populated |
319 | 21:53:47 <DanielK_WMDE_> jynus: larger than revision. that's the point! |
320 | 21:54:02 <TimStarling> small per row though |
321 | 21:54:07 <DanielK_WMDE_> yes, true |
322 | 21:54:10 <jynus> shouldn't we have visible code at that point? |
323 | 21:54:17 <anomie> And as active as the revision table. |
324 | 21:54:44 <DanielK_WMDE_> jynus: for the database? sure, there's a patch on gerrit, and i send you sample db dumps. they are not exactly like the proposal, but quite close. |
325 | 21:54:53 <jynus> no, no |
326 | 21:54:56 <jynus> mediawiki code |
327 | 21:55:11 <Scott_WUaS> Great All - thanks! |
328 | 21:55:34 <DanielK_WMDE_> jynus: i have been working on that with brion. lots of moving parts. it's a bit of a hen-and-egg thing |
329 | 21:55:41 <jynus> we will deploy whatever we have, I suppose, and whatever is easy to migrate? |
330 | 21:55:41 <brion> :) |
331 | 21:55:41 <DanielK_WMDE_> you can't write the code before the db schema is final. |
332 | 21:55:55 <jynus> the migration part is the key |
333 | 21:55:57 <DanielK_WMDE_> you can't decide on a schema if it's not clear how it will impact the php code |
334 | 21:56:32 <jynus> I am all for converting tables rather than migrating them (I think we discussed that on the image issue) |
335 | 21:56:34 <DanielK_WMDE_> jynus: yes, i agree. that'S the tricky part. luckily, i have some experience with that, but i will be needing your help. |
336 | 21:57:04 <jynus> I do not think we can give a decision with a plan(?) |
337 | 21:57:12 <jynus> *without |
338 | 21:57:38 <DanielK_WMDE_> jynus: well you can't make a plan when there is no decision on the goal :) |
339 | 21:58:07 <TimStarling> can I just repeat that I am putting my 2c in for MCR fields in the initial content table, with slot=1 always |
340 | 21:58:17 <jynus> I am not blocking that, I am asking what is the general mood, I said I will not take part on this decision, db is not a blocker here |
341 | 21:58:17 <DanielK_WMDE_> jynus: writing complete migration code before it's even clear whether the migration is wanted is not a thing i like to do |
342 | 21:58:20 <TimStarling> because I think that will help motivate Daniel to actually write MCR |
343 | 21:58:31 <brion> :)))) |
344 | 21:58:42 <DanielK_WMDE_> TimStarling is a smart guy :) |
345 | 21:59:05 * anomie likes TimStarling's reasoning |
346 | 21:59:12 <robla> #info 14:58:08 <TimStarling> can I just repeat that I am putting my 2c in for MCR fields in the initial content table, with slot=1 always |
347 | 21:59:13 <Scott_WUaS> Aye :) |
348 | 21:59:16 <jynus> decision on the last question, not on the whole issue |
349 | 21:59:26 <DanielK_WMDE_> so, i'll work on finalizing the schema, and propose migration code |
350 | 21:59:32 <jynus> (I was only talking about the last questio, I cannot answer that) |
351 | 21:59:34 <DanielK_WMDE_> then we talk again, where or on wikitech-l |
352 | 21:59:54 <jynus> because I genuinly do not know |
353 | 22:00:30 <DanielK_WMDE_> sorry, which last question? |
354 | 22:00:43 <jynus> the one about the extra columns |
355 | 22:00:53 <robla> ok, I think this was a good meeting; we'll keep using T105652 and wikitech-l for followup. sound good? |
356 | 22:00:58 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652 |
357 | 22:01:05 <brion> Yay |
358 | 22:01:05 <DanielK_WMDE_> robla: yep :) |
359 | 22:01:23 <DanielK_WMDE_> jynus: yea... i think i'll shoot for the "medium" proposal for now. |
360 | 22:01:32 <jynus> so can we deploy the schema change soon? |
361 | 22:02:05 <DanielK_WMDE_> jynus: can you confirm that adding fields to a table that is much like the revision table isn't a problem? |
362 | 22:02:19 <robla> jynus: if/when you understand and agree; we don't need to try to force that now |
363 | 22:02:34 <jynus> I cannot confirm it will not be a problem |
364 | 22:02:43 <jynus> schema changes on revision are hard |
365 | 22:02:53 <anomie> DanielK_WMDE_: Can we qualify that? revision has large rows and many rows. The new table will have small(er) rows, but still many. |
366 | 22:02:55 <robla> I think we've settled that it's worth DanielK_WMDE_ to keep working on a proposal and some code/etc |
367 | 22:03:09 <DanielK_WMDE_> jynus: ok. the content table will have roughly the same dimensions as revision. that's why i want to add all fields right away, instead of incrementing. |
368 | 22:03:11 <robla> (with legoktm , et al) |
369 | 22:03:15 <DanielK_WMDE_> let's discuss that on the listz |
370 | 22:03:26 * robla will hit #endmeeting in 60 seconds |
371 | 22:03:39 <anomie> (it might be easier to answer that question with a straw schema) |
372 | 22:03:54 <robla> feel free to continue discussion on #wikimedia-tech |
373 | 22:04:24 <robla> thanks everyone! |
374 | 22:04:26 <robla> o/ |
375 | 22:04:31 <robla> #endmeeting |
Other meetings
Architecture meetings | ||
---|---|---|
13:00 PT ArchCom Planning Meetings | upcoming | all since 2016-03-30 |
14:00 PT ArchCom-RFC Meetings | upcoming | all since 2015-09-09 |