ArchCom RFC Meeting W33: content model storage (2016-08-17, #wikimedia-office)
ActivePublic
Actions

Hosted by daniel on Aug 17 2016, 9:00 PM - 10:00 PM.

Description

Agenda

Location: #wikimedia-office IRC channel
Meeting type: Consensus for final comment (on the three options).
Time: Weekly, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST)
Topic: T105652 Content model storage

In this meeting, we discussed the storage of content model and format in the database. A proposal (T105652) was approved a year ago, but never implemented. Three options were discussed:

Implement T105652: RfC: Content model storage as originally approved a year ago, with new columns in the page, revision, and archive tables.
Implement a modified version, T142980: RFC: Create a content meta-data table, with one extra table but no new columns, to also cater to the needs of multi content revisions.
take a step back and reconsider.

We agreed that Daniel should continue to develop the plan articulated in T142980.

Meeting summary

Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ (robla, 21:03:15)
- LINK: https://phabricator.wikimedia.org/E261 (robla, 21:03:31)
- LINK: https://phabricator.wikimedia.org/T105652 (robla, 21:03:47)
- LINK: https://phabricator.wikimedia.org/T105652 (DanielK_WMDE_, 21:03:48)
- LINK: https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_, 21:03:54)
- LINK: https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_ 's revised proposal) (robla, 21:04:16)
- primary question to resolve: do a) legoktm 's original T105652 b) DanielK_WMDE_ 's modification T142980 c) none of the above (stay with status quo) (robla, 21:06:41)
- 14:08:34Â <legoktm>Â my original plan wasn't to do joins, but to store the id => string mapping in a cache like APC since it would be mostly static once initialized (robla, 21:09:24)
- ar_rev_id is not fully populated on enwiki. we can assign fresh revision ids though (and bump rev_id accordingly) (DanielK_WMDE_, 21:15:31)
- <DanielK_WMDE_> we will need to construct legacy rows eventually, when we move the blob address into the content table. (brion, 21:19:19)
- 14:16:31Â <jynus>Â the main issues happen when dataset doesn't fit into memory, which is exactly what I blocked (as the initial rolling in was going to do) (robla, 21:19:28)
- Discussion of DanielK_WMDE_'s question: "can I assume that there is still consensus on representing content model and format as integers, and have a mappoing in the db and in memory?" (robla, 21:23:20)
- <brion>Â jynus: I always hear enums are cheap to change. Lies? :) <jynus>Â they are cheap to add [...]Â but if you want to delete, it would be one of our most complex changes (robla, 21:26:14)
- re managing ids for content models etc: on a cache miss, check the db. if the db doesn't have it, add it. (DanielK_WMDE_, 21:26:26)
- tentative agreement to content model and format as int; we rule out option (c) then (DanielK_WMDE_, 21:28:38)
- <jynus>Â I think a) -> b) is easy to do, why do we want to do b directly (genune question) (robla, 21:30:18)
- <DanielK_WMDE_>Â we have not discussed whether the new table should just have the minimum fields for now, or the full set needed for MCR <jynus>Â adding new columns on a small table with low traffic is easy (robla, 21:53:42)
- 14:58:08Â <TimStarling>Â can I just repeat that I am putting my 2c in for MCR fields in the initial content table, with slot=1 always (robla, 21:59:12)

Meeting ended at 22:04:31 UTC.

People present (lines said)

DanielK_WMDE_ (98)
jynus (87)
brion (45)
TimStarling (33)
robla (27)
gwicke (26)
anomie (20)
legoktm (13)
stashbot (9)
James_F (4)
SMalyshev (3)
wm-labs-meetbot` (3)
Scott_WUaS (3)
ori (2)
tgr (1)
aude (1)

Full log

P3846 ArchCom-RFC-2016W33-irc-E261.txt

1	21:03:05 <robla> #startmeeting ArchCom content model storage (T105652)
2	21:03:05 <wm-labs-meetbot`> Meeting started Wed Aug 17 21:03:05 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
3	21:03:05 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
4	21:03:05 <wm-labs-meetbot`> The meeting name has been set to 'archcom_content_model_storage__t105652_'
5	21:03:05 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652
6	21:03:15 <robla> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) \| Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
7	21:03:31 <robla> #link https://phabricator.wikimedia.org/E261
8	21:03:47 <robla> #link https://phabricator.wikimedia.org/T105652
9	21:03:48 <DanielK_WMDE_> #link https://phabricator.wikimedia.org/T105652
10	21:03:54 <DanielK_WMDE_> #link https://phabricator.wikimedia.org/T142980
11	21:03:58 <DanielK_WMDE_> oops :)
12	21:04:08 * aude waves
13	21:04:16 <robla> #link https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_ 's revised proposal)
14	21:04:33 <TimStarling> smallint = 16 bits?
15	21:04:46 <DanielK_WMDE_> So, today I would like to discuss if and how we want to modify the way we store meta-data about revision content, in particular, how and where we store the content model and format.
16	21:04:56 <DanielK_WMDE_> The RFC legoktm proposed last year aimed to make the storage of content model and format more efficient (T105652). I'm concerned that the solution that was approved then would need to be reverted to add support for multiple content slots per revision (T107595).
17	21:04:57 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652
18	21:04:57 <stashbot> T107595: [RFC] Multi-Content Revisions - https://phabricator.wikimedia.org/T107595
19	21:05:06 <James_F> TimStarling: Yes, per https://dev.mysql.com/doc/refman/5.5/en/integer-types.html
20	21:05:23 <DanielK_WMDE_> While the problems are unrelated, the solutions overlap. So I propose to kill two birds with one stone, and add a new table for content meta-data that will use the new efficient way to represent content model anf format (T142980).
21	21:05:24 <stashbot> T142980: RFC: Create a content meta-data table - https://phabricator.wikimedia.org/T142980
22	21:05:28 <DanielK_WMDE_> The new table will also allow us to have more than one content "slot" per revision. And we won't have to add any columns to page, revision, or archive.
23	21:05:51 <DanielK_WMDE_> My goal for this meeting is to decide whether we want to implement legoktm's original proposal, or the modified one with the extra table.
24	21:05:59 <gwicke> is jynus around?
25	21:06:13 <jynus> I am
26	21:06:22 <gwicke> ah, great!
27	21:06:41 <robla> #info primary question to resolve: do a) legoktm 's original T105652 b) DanielK_WMDE_ 's modification T142980 c) none of the above (stay with status quo)
28	21:06:41 <brion> So I'm not sure we'd need to add anything to page anyway (can join to revision) but the duplication between archive and revision is annoying and I like removing it by partially normalizing
29	21:06:42 <DanielK_WMDE_> do we have jcrespo here?
30	21:06:43 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652
31	21:06:43 <stashbot> T142980: RFC: Create a content meta-data table - https://phabricator.wikimedia.org/T142980
32	21:06:57 <jynus> I am that that you call jcrespo
33	21:07:08 <gwicke> jynus: do you expect us to use compression in the foreseeable future, and do you expect the extra joins to be faster than decompression?
34	21:07:21 <jynus> compression?
35	21:07:22 <DanielK_WMDE_> jynus: ah, sorry. good to have you here!
36	21:07:30 <gwicke> oh, right... /me is clearly consused
37	21:07:32 <gwicke> ;)
38	21:07:40 <Scott_WUaS> :)
39	21:07:45 <gwicke> confused, even
40	21:07:59 <gwicke> anyway, that would be my main question about either proposal
41	21:08:21 <robla> jynus: thanks for being here this evening!
42	21:08:27 <jynus> do you mean normalization?
43	21:08:34 <legoktm> my original plan wasn't to do joins, but to store the id => string mapping in a cache like APC since it would be mostly static once initialized
44	21:08:50 <TimStarling> there are a few things that concern me
45	21:09:05 <DanielK_WMDE_> yea, i also forsee no joins for resolving the numeric ids.
46	21:09:22 <DanielK_WMDE_> if we have an extra content table, we'd have an extra join though, for many use cases.
47	21:09:24 <robla> #info 14:08:34 <legoktm> my original plan wasn't to do joins, but to store the id => string mapping in a cache like APC since it would be mostly static once initialized
48	21:09:26 <gwicke> without joins, it sounds like we would need to manually maintain some mapping
49	21:09:32 <brion> So just joins on (rvision, content)
50	21:09:33 <TimStarling> one is whether adding a new table with hundreds of millions of rows is justified, considering the performance implications due to loss of locality, compared to the proposal on the wiki page
51	21:09:48 <gwicke> how would this work for extensibility?
52	21:09:57 <DanielK_WMDE_> TimStarling: if we want MCR, we will have to do that anyway, sooner or later
53	21:10:03 <TimStarling> another is the fact that ar_rev_id is not fully populated, there are 500k rows in enwiki.archive with ar_rev_id=null
54	21:10:07 <gwicke> i.e, custom models etc
55	21:10:16 <brion> gwicke: similar to namespace ids, though hopefully more consistently managed
56	21:10:31 <brion> Or else some config map
57	21:10:40 <gwicke> so we'd reserve numeric ranges to avoid conflicts?
58	21:10:42 <jynus> whatever that doesn't require storing usless strings will be faster
59	21:10:52 <TimStarling> some of those rows were created before the MW 1.5 and so there was never any rev_id for them in the first place
60	21:10:53 <brion> TimStarling: good point - but those are all legacy rows with default content model
61	21:11:01 <brion> So perhaps ok to have no matching row
62	21:11:01 <DanielK_WMDE_> TimStarling: could we just assign them an unused rev_id?
63	21:11:25 <brion> Heh
64	21:11:28 <TimStarling> yes
65	21:11:42 <DanielK_WMDE_> i'd vote for that, then
66	21:12:05 <brion> That raises the specter perhaps of killing the demoralization between archive and revision by folding archive into rvision. That's a bigger issue tho
67	21:12:24 <TimStarling> excellent freduian slip there
68	21:12:28 <jynus> brion, I am all for that, but maybe out of the scope of this RFC
69	21:12:29 <DanielK_WMDE_> brion: yea, i didn't really want to touch that today
70	21:12:30 <robla> :-)
71	21:12:34 <brion> Yeah
72	21:12:40 <gwicke> could we test the performance options ahead of time, to validate our intuition?
73	21:12:41 <brion> Baby steps!
74	21:12:41 <legoktm> (there's a different RfC for that! https://www.mediawiki.org/wiki/Requests_for_comment/Page_deletion)
75	21:13:19 <jynus> gwicke, sure, I can setup a demo if you need it, if you provide the code
76	21:13:26 <brion> Gwicke good idea perhaps, since we expect to need this second table for multi content revisions in future even if we don't start itf off now
77	21:13:40 <brion> Would be a similar but slightly different join
78	21:13:49 <brion> On th rev id and the role
79	21:14:03 <gwicke> do we have a db with a large-enough dataset that we could play with?
80	21:14:16 <tgr> DanielK_WMDE_: unused or reserved? setting ar_rev_id to something that later gets assigned to an unrelated revision seems like asking for trouble
81	21:14:22 <jynus> but, please do not fear joins without reasoning (even if your favorite storage system does not support them)
82	21:14:28 <DanielK_WMDE_> in some cases, we will be able to join page directly to content, and skip the revision table. in such cases, we'd not even add a join.
83	21:14:45 <TimStarling> just insert rows into revision and immediately delete them
84	21:14:57 <TimStarling> or maybe even insert and rollback, maybe that works
85	21:15:01 <gwicke> jynus: not fear, but it's good to measure before making decisions
86	21:15:04 <DanielK_WMDE_> tgr: rev_id would have to be bumpt to something greater than any of the ids we used for ar_rev_id.
87	21:15:21 <brion> Aaaaanyway were not doing that bit yet ;)
88	21:15:31 <DanielK_WMDE_> #info ar_rev_id is not fully populated on enwiki. we can assign fresh revision ids though (and bump rev_id accordingly)
89	21:15:39 <jynus> gwicke, sure, although I can guarantee no slowdown, not a huge improvement right now
90	21:16:17 <gwicke> I'm especially curious if generic compression would achieve the same effect with less effort
91	21:16:18 <DanielK_WMDE_> TimStarling: i think you can even just insert the number you want, and the auto.increment keeps going after that. so just insert & delete one row.
92	21:16:31 <jynus> the main issues happen when dataset doesn't fit into memory, which is exactly what I blocked (as the initial rolling in was going to do)
93	21:16:55 <brion> Ah DanielK_WMDE_ that reminds me
94	21:16:57 <gwicke> memory being the page cache?
95	21:17:08 <gwicke> or result set held in memory?
96	21:17:20 <brion> Do we need legacy rows or are empty left joins assumed to mean namespace default?
97	21:17:36 * jynus recommends gabriel reading about InnoDB buffer pool
98	21:18:10 <gwicke> jynus: I am aware of that, but am not sure if we configure that to use most memory, or rely on page cache
99	21:18:17 <DanielK_WMDE_> brion: we will need to construct legacy rows eventually, when we move the blob address into the content table.
100	21:18:18 <gwicke> also, does it hold compressed pages, or decompressed ones?
101	21:18:28 <DanielK_WMDE_> brion: whether we ewant/need it right away is up for discussion.
102	21:18:40 <brion> Ah good
103	21:18:54 <jynus> yes, this is the only thing I can give to this discussion
104	21:19:08 <jynus> doing the schema change
105	21:19:19 <brion> #info <DanielK_WMDE_> we will need to construct legacy rows eventually, when we move the blob address into the content table.
106	21:19:20 <TimStarling> seems like this content table including MCR will be functionally equivalent to the text table
107	21:19:27 <jynus> from one version to the other
108	21:19:28 <robla> #info 14:16:31 <jynus> the main issues happen when dataset doesn't fit into memory, which is exactly what I blocked (as the initial rolling in was going to do)
109	21:19:36 <jynus> is trivial
110	21:19:41 <brion> Hmmmm
111	21:19:42 <jynus> if
112	21:19:42 <TimStarling> except with a one-to-many mapping from revision to content
113	21:19:48 <jynus> a) the table is small
114	21:19:53 <jynus> which I think it will be
115	21:19:55 <James_F> TimStarling: Yes.
116	21:20:04 <brion> TimStarling: do you envision changing text table instead?
117	21:20:07 <jynus> b) the transactions that use them are small
118	21:20:20 <DanielK_WMDE_> ok... can I assume that there is still consensus on representing content model and format as integers, and have a mappoing in the db and in memory? This was approved last year with legoktm's original proposal.
119	21:20:28 <jynus> this is problematic, for example, for tables such as revision or commons image
120	21:20:37 <brion> DanielK_WMDE_: yes I've been assuming that holds :)
121	21:20:44 <anomie> TimStarling: Well, the content table doesn't allow storing the revision data directly. It's either revision→content→text, or revision→content→(external something)
122	21:20:54 <jynus> this is the only thing that I can say that may be relatively helpful, the rest is for you to decide
123	21:20:57 <TimStarling> brion: no
124	21:21:09 <brion> ok
125	21:21:32 <gwicke> DanielK_WMDE_: I'm still wishing we had performance data to back up the assertion that this is the most efficient way to improve performance, and is worth the hassle of managing ids manually
126	21:21:35 <TimStarling> anomie: right, so you still need the text table, but do you keep text_flags etc.?
127	21:21:57 <DanielK_WMDE_> gwicke: nobody is suggesting to manage them manually.
128	21:22:04 <TimStarling> does ExternalStore continue to work only with text rows, or can it also work with content rows?
129	21:22:18 <DanielK_WMDE_> gwicke: if no mapping is found in the in-memory mapping, you assign a new one from an auto-increment field. done.
130	21:22:19 <brion> gwicke: so the alternatives are probably enums or another join to a table of mappings. They're all roughly equivalent theoretically but may perform different dunno
131	21:22:26 <gwicke> DanielKWMDE: previous suggestion was to set them up in the config like namespaces
132	21:22:44 <jynus> say no to enums on a table as large as revision
133	21:22:50 <jynus> ok in other cases
134	21:23:12 <DanielK_WMDE_> gwicke: that's not how i understand the approved rfc. let me check again...
135	21:23:14 <brion> jynus: I always hear enums are cheap to change. Lies? :)
136	21:23:20 <robla> #info Discussion of DanielK_WMDE_'s question: "can I assume that there is still consensus on representing content model and format as integers, and have a mappoing in the db and in memory?"
137	21:23:23 <anomie> TimStarling: Is text_flags used for anything when text.old_text actually contains the content instead of an external-store address? If so, then it would still be used for that purpose when the text table is used at all. When external-store is used for everything, the address would be in cont_address and the text table shouldn't be needed at all.
138	21:23:24 <jynus> they are cheap to add, brion
139	21:23:24 <gwicke> so compression is definitely not achieving similar savings?
140	21:23:31 <jynus> (items)
141	21:23:53 <brion> Heh
142	21:23:54 <jynus> but if you want to delete, it would be one of our most complex changes
143	21:23:57 <anomie> (although "external store" is now named "blob store", I think)
144	21:23:58 <brion> Yikes
145	21:24:17 <legoktm> gwicke: well, yeah, storing them in config like namespaces would be more performant. but then you end up with a wikipage where extensions write down the ids their content models use, and you just cross your fingers and hope you don't conflict with anyone. Anyways, we discussed and rejected that last year....
146	21:24:25 <TimStarling> anomie: 1. yes, compression and legacy charset mapping 2. good question
147	21:24:54 <DanielK_WMDE_> gwicke: hm, the old rfc doesn't really say. but it would be easy to do, i already wrote some code for this. no need for manually managing ids.
148	21:25:07 <gwicke> DanielKWMDE, legoktm: okay, so it would be stored in the db, but some background task would stash the db data into some cache & update that when needed?
149	21:25:35 <jynus> so, I said this many times: do not fear joins just for the sake of an extra table (but I do not care, config/table/whatever)
150	21:25:35 <DanielK_WMDE_> gwicke: no background task. on a cache miss, check the db. if the db doesn't have it, add it. done.
151	21:25:40 <brion> Why join when you can put a blob in memcached :)
152	21:25:58 <legoktm> what DanielK_WMDE_ said.
153	21:25:59 <brion> Quite effective for small sets like this yeah
154	21:26:01 <gwicke> DanielK_WMDE_: that's "when needed" ;)
155	21:26:14 <robla> #info <brion> jynus: I always hear enums are cheap to change. Lies? :) <jynus> they are cheap to add [...] but if you want to delete, it would be one of our most complex changes
156	21:26:15 <gwicke> but yeah, that sounds doable
157	21:26:26 <DanielK_WMDE_> #info re managing ids for content models etc: on a cache miss, check the db. if the db doesn't have it, add it.
158	21:27:14 <jynus> for example, 2 selects, I guarantee you will be slower than 1 single query with a join
159	21:27:20 <brion> I think people working with raw db replicas would appreciate having the mapping in a db table even if it's not used for joins in production
160	21:27:38 <jynus> but again, not agains memcache/config/whatever
161	21:27:52 <DanielK_WMDE_> brion: also, you need to persist the mapping, in case you lose the cache
162	21:27:54 <DanielK_WMDE_> so... tentative agreement to content model and format as int? can we rule out option (c) then, and look at (a) vs (b)?
163	21:27:58 <brion> Yeah
164	21:28:09 <brion> Definitely
165	21:28:12 <ori> well, ...
166	21:28:15 <ori> (just kidding.)
167	21:28:17 <brion> Lol
168	21:28:20 <DanielK_WMDE_> heh :P
169	21:28:38 <DanielK_WMDE_> #info tentative agreement to content model and format as int; we rule out option (c) then
170	21:28:39 <jynus> which link has a b and c, sorry?
171	21:28:55 <DanielK_WMDE_> jynus: <robla> #info primary question to resolve: do a) legoktm 's original T105652 b) DanielK_WMDE_ 's modification T142980 c) none of the above (stay with status quo)
172	21:29:03 <gwicke> at this point I'm assuming that compression was tested & found to not be competitive
173	21:29:04 <TimStarling> if you remember, the use of strings in the first place was quite a fraught compromise
174	21:29:17 <legoktm> Yes, let's rule out c)
175	21:29:19 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652
176	21:29:23 <TimStarling> I asked for integers from the outset, and Daniel bluntly refused
177	21:29:36 <TimStarling> so obviously yes, I am still in favour of using integers
178	21:29:39 <stashbot> T142980: RFC: Create a content meta-data table - https://phabricator.wikimedia.org/T142980
179	21:29:45 <jynus> I think a) -> b) is easy to do, why do we want to do b directly (genune question)
180	21:29:59 <DanielK_WMDE_> TimStarling: o_O i actually had that implemented, and changed it upon request... from... uh... i don't recall.
181	21:30:12 <DanielK_WMDE_> TimStarling: my implementation did expose the in ids though. not as nice as we are planning for now
182	21:30:18 <robla> #info <jynus> I think a) -> b) is easy to do, why do we want to do b directly (genune question)
183	21:30:19 <jynus> may main concern is to get too blocked on a relatively complex feature
184	21:30:21 <TimStarling> yeah, you started with integers, switched to strings, then came to me and I said "use integers"
185	21:31:04 <TimStarling> and you said it was some reviewer who told you to use strings, and I said go back to them and tell them I said use integers
186	21:31:09 <anomie> jynus: To avoid having to change the primary key on the table from (cont_revision) to (cont_revision,cont_role) once the table has millions of rows. You'd probably be in the best position to say whether that'd be a big deal or not.
187	21:31:26 <TimStarling> and you said you didn't really like integers in the first place
188	21:31:28 <jynus> are we talking about the small table?
189	21:31:37 <DanielK_WMDE_> TimStarling: i kind of remember it the other way... but whatever. no need to fight about that now. let's get rid of them.
190	21:31:43 <jynus> the one that contains formats?
191	21:31:50 <brion> Ok so something like b will be needed later for multi content, but not strictly yet. If transitions from current to a, and a to b-prime are easy, then there's not a huge need to start with b
192	21:31:55 <DanielK_WMDE_> yea, i didn't like them to be exposed...
193	21:32:07 <anomie> jynus: Oh, wait. Actually, (a) is putting rev_format_id and rev_model_id into revision, while (b) is making the content table.
194	21:32:38 <jynus> then I vote for d
195	21:32:51 * anomie got confused between (a) and (b)'s basic versus medium versions
196	21:33:01 <DanielK_WMDE_> jynus: what's (d)?
197	21:33:22 <jynus> stick to the original plan, then introduce the slot stuff afterwards
198	21:33:52 <jynus> I do not think a and be are exclusive?
199	21:33:58 <DanielK_WMDE_> jynus: a) -> b) involved assing columns to page, revision, and archive, and then removing the same columns again later. i assumed that's not something you like.
200	21:34:09 <anomie> jynus: So that's adding rev_format_id/rev_model_id (and similar to other tables) now, then later on make the content table and migrate to that?
201	21:34:11 <DanielK_WMDE_> it's defintly disruptive for tools on labs, and for extensions
202	21:34:27 <jynus> WHAT?
203	21:34:46 <jynus> ah, the extension thing could be a valid reason
204	21:35:04 <legoktm> er, why is it disruptive for extensions?
205	21:35:05 <jynus> finally I get a good answer to my original question
206	21:35:20 <jynus> I do not know if it is valid, the rest can tell me
207	21:35:23 <DanielK_WMDE_> it also means messing with the same code again, once changing it to use a sifferent column, then changing it to use a different table
208	21:35:55 <DanielK_WMDE_> legoktm: for extensions that look at the database directly. rare, i agree. not rare for tools on labs.
209	21:35:57 <jynus> wait, but the original change doesn't involve a schema change, right?
210	21:36:01 <DanielK_WMDE_> actually, it's the reason we have labs
211	21:36:07 <jynus> just a new table?
212	21:36:31 <DanielK_WMDE_> jynus: the original change calls for columns to be added to page, revision, and archive. two columns each.
213	21:36:37 <DanielK_WMDE_> to the largest tables we have
214	21:36:40 <legoktm> DanielK_WMDE_: when I grepped a year ago, the only places that queried those columns directly was in core
215	21:36:54 <DanielK_WMDE_> jynus: which would then be removed again, when we have the content table.
216	21:37:12 <jynus> weren't we going to reuse the existing columns and redefine its meaning?
217	21:37:16 <DanielK_WMDE_> legoktm: good to know, yea. extensions probably wouldn't
218	21:37:20 <DanielK_WMDE_> jynus: no.
219	21:37:25 <SMalyshev> if we plan MCRs anytime soon, I think doing work that will have to be redone with MCRs now is not smart
220	21:37:34 <DanielK_WMDE_> jynus: that would be even more disruptive
221	21:37:45 <James_F> SMalyshev: We do.
222	21:37:45 <jynus> but I will do that work on production (?)
223	21:37:55 <jynus> and labs
224	21:38:12 <jynus> obiously you are the code magicians
225	21:38:35 <DanielK_WMDE_> jynus: the problem with labs is not applying the schema change. the problem is braking the tools that use the schema.
226	21:38:51 <DanielK_WMDE_> more changes -> more breakage
227	21:39:00 <brion> SMalyshev: I think transition work would be similar difficulty in either case on the mw internals
228	21:39:05 <jynus> I call you this now, labs is not an issue
229	21:39:14 <jynus> and MCR will break it anyway
230	21:39:19 <DanielK_WMDE_> James_F: we do what?
231	21:39:29 <legoktm> DanielK_WMDE_: I don't think we've ever promised database stability for labs. and I don't think we should worry about it tbh.
232	21:39:35 <legoktm> database schema stability*
233	21:39:36 <jynus> +1
234	21:39:54 <jynus> labs is too broken anyway (do not quote me on that)
235	21:39:56 <SMalyshev> brion: can't we make a model now that will make it easier? I mean sure we'll have to do work, but we could prepare for it
236	21:40:02 <jynus> (it is all DBA's fault)
237	21:40:25 <DanielK_WMDE_> i'd like incremental steps, instead of doing one step, then undoing half of it again to do the second step.
238	21:40:34 <jynus> I think the important question is
239	21:40:36 <anomie> jynus: All the options add two small tables to map int->string for model and for format. Then option (a) adds new int columns to several tables, populates them based on the existing string columns, then drops the string columns. Option (b) puts the new int columns into a "content" table with a FK back to the revision/archive tables, populates it from the existing string columns, then drops the string columns. Eventually we'll need to do option
240	21:40:36 <anomie> (b) for the multi-content revisions anyway.
241	21:40:52 <jynus> independetly of what could be done
242	21:41:06 <jynus> who is willing to work on this?
243	21:41:11 <jynus> on both cases
244	21:41:13 <jynus> ?
245	21:41:35 <jynus> or on either case, I mean
246	21:41:41 <DanielK_WMDE_> me for option (b). not sure i can justify spending time on option (a), but i might.
247	21:41:48 <SMalyshev> willing as in "interested" or "has time" or 1&2? ;)
248	21:41:51 <DanielK_WMDE_> i can definitly help with reviewing in either case
249	21:42:00 <brion> :)
250	21:42:01 <jynus> (note that I will do what devs told me, no matter what, infrastructure wise)
251	21:42:10 <anomie> There's a side question as to whether option (b)'s primary key should start out as just (cont_revision), or if we should make it (cont_revision,cont_role) right away and just let cont_role always be 1 until we make the infrastructure for different values.
252	21:42:28 <brion> I'll pitch in if needed either way, but I've got other projects backing up :)
253	21:42:36 <legoktm> I am volunteering to do the implementation for (a), and can help review (b)/whatever
254	21:42:49 <DanielK_WMDE_> anomie: actually, i updated that - (cont_revision,cont_role) would be a unique key, we'd add an auto-increment field as a primary, for later use
255	21:43:24 <DanielK_WMDE_> legoktm: would you also be willing to do the bit that is the same for (a) and (b), namely the actual mapping stuff?
256	21:43:25 <anomie> DanielK_WMDE_: Any particular reason to add an id field instead of a two-int PK?
257	21:44:06 <DanielK_WMDE_> anomie: yes, we can re-use content rows for multiple revisions that way. some slots will update only rarely. it's another bit of normalization.
258	21:44:10 <jynus> give your thoughts, I do not have much to add, I personally incline towards small incremental changes if it was possible, but I cannot fairly enter here if I am not going to work on the code
259	21:44:15 <DanielK_WMDE_> anomie: we can also add that later, but the table is big.
260	21:44:22 <brion> So as long as we have realistic transition plan, and there's no major db-related reason to prefer one or the other, my main concern is we get the updates done.
261	21:44:49 <brion> If we make the content table in a way that we don't have to change it when adding multi content then that is a certain niceness
262	21:44:58 <DanielK_WMDE_> jynus: i'm with you regarding incremental changes. i just feel this isn't incremental, but two forward, one back, two forward... that's what i'm trying to avoid
263	21:45:05 <legoktm> DanielK_WMDE_: yeah.
264	21:45:08 <anomie> DanielK_WMDE_: How would that (re-using content rows) work?
265	21:45:10 <TimStarling> small changes would be fine except that DanielK_WMDE_ is working on this because he is interested in MCR
266	21:45:34 <TimStarling> putting the MCR fields in the table from the outset would better respect that motivation
267	21:45:35 <jynus> TimStarling, but he just said he wouldn't help on b, only lego would on a
268	21:45:39 <DanielK_WMDE_> anomie: another table, relating revisions to content. it's an option for later though.
269	21:46:02 <jynus> sorry, I got confused, but I hope I got understood
270	21:46:20 <DanielK_WMDE_> jynus: i would help with a, but i probably can't drive it.
271	21:46:22 <TimStarling> <DanielK_WMDE_> me for option (b). not sure i can justify spending time on option (a), but i might.
272	21:46:25 <brion> legoktm: any concerns on later difficulties if we do two steps? (A, then later b when we need multi)?
273	21:46:34 <anomie> DanielK_WMDE_: All problems can be solved by adding another layer of indirection? ;)
274	21:46:38 <TimStarling> Daniel will work on option b because it is a step towards MCR, which is fair enough
275	21:46:50 <jynus> DanielK_WMDE_, but does b include all a functionality, would you work on a's functionality on b=
276	21:46:51 <James_F> It's the Java way™! Oh, wait. ;-)
277	21:46:51 <legoktm> brion: not off the top of my head
278	21:46:54 <DanielK_WMDE_> anomie: and then you have a problem pointer...
279	21:46:57 <brion> Ok
280	21:47:29 <DanielK_WMDE_> jynus: i would, but i'd be greatful if legoktm would help
281	21:47:41 <jynus> ok, what does lego have to say about that?
282	21:47:44 * robla plans to move this RFC (T105652) to the "in progress" column on the ArchCom-RFC board at the conclusion of this meeting, pointing to this meeting (E261)
283	21:47:51 * anomie wonders whether the complexity of revision→revision_content_mapping→content would be worth the savings in not duplicating cont_address and other fields.
284	21:48:12 <DanielK_WMDE_> anomie: yes, i wonder about that too. it's an option, not a plan
285	21:48:31 <legoktm> me helping in b)? I think so yeah
286	21:48:38 <DanielK_WMDE_> jynus: <legoktm> I am volunteering to do the implementation for (a), and can help review (b)/whatever
287	21:48:48 <DanielK_WMDE_> \o/
288	21:48:53 <brion> Woo
289	21:48:58 <robla> it seems to me there is very cautious consensus around option "b", with some skepticism about who will do the work
290	21:49:01 <jynus> ok, we still have anomies issue
291	21:49:06 <jynus> what about that
292	21:49:17 <DanielK_WMDE_> robla: the tricky part is the migration code
293	21:49:21 <jynus> can the scope be reduced to comtemplate that?
294	21:49:24 <DanielK_WMDE_> the rest should be pretty easy
295	21:49:43 <DanielK_WMDE_> jynus: which issue?
296	21:49:55 <jynus> anomie wonders whether the complexity of revision→revision_content_mapping→content would be worth the savings in not duplicating cont_address and other fields.
297	21:49:59 <robla> DanielK_WMDE_: that's why I'm proposing "in progress" as a state, rather than "approved"
298	21:50:41 <DanielK_WMDE_> jynus: the current proposal is revision -> content. revision -> r-c-mapping -> content is a possibility for later
299	21:50:45 <DanielK_WMDE_> we are not committing to that
300	21:50:48 <TimStarling> I don't know what revision_content_mapping is
301	21:51:25 <DanielK_WMDE_> TimStarling: a way to re-use content entries for multiple revisions. outside the scope of this rfc.
302	21:51:40 <anomie> For the record, I'm sure I'll do some stuff on this, although whether that stuff involves writing code or just reviewing it I don't know at this point. Too many people all writing code can get in each others' way.
303	21:51:43 <DanielK_WMDE_> it might be nice, or it might be horrible to go that way, not sure yet.
304	21:51:48 <TimStarling> ah right
305	21:51:53 <DanielK_WMDE_> anomie: thanks!
306	21:52:03 <jynus> ok, we need more "buts", anyone?
307	21:52:12 <TimStarling> well, the existing rev_text_id allows that, it is used for null revisions
308	21:52:19 <DanielK_WMDE_> we have not discussed whether the new table should just have the minimum fields for now, or the full set needed for MCR
309	21:52:34 <DanielK_WMDE_> but I guess I can sort that out with jynus later. or we have another session on that
310	21:52:46 <jynus> adding new columns on a small table with low traffic is easy
311	21:53:16 <TimStarling> it will be a small table? with left joins?
312	21:53:18 <DanielK_WMDE_> TimStarling: that allows the re-used of blobs for multiple content entries. not quite the same. but yea - re-using content meta-data may not be wirth the trouble.
313	21:53:30 <jynus> (it is a bit of a simplificatin, go to https://wikitech.wikimedia.org/wiki/Schema_changes for the full version
314	21:53:39 <DanielK_WMDE_> jynus: the content table is going to be LARGE!
315	21:53:42 <anomie> jynus: What about on a table with individually small rows, but with as many rows as the revision table?
316	21:53:42 <robla> #info <DanielK_WMDE_> we have not discussed whether the new table should just have the minimum fields for now, or the full set needed for MCR <jynus> adding new columns on a small table with low traffic is easy
317	21:53:43 <brion> It I'll have lots of entries but small rows
318	21:53:45 <TimStarling> for MCR it presumably needs to be fully populated
319	21:53:47 <DanielK_WMDE_> jynus: larger than revision. that's the point!
320	21:54:02 <TimStarling> small per row though
321	21:54:07 <DanielK_WMDE_> yes, true
322	21:54:10 <jynus> shouldn't we have visible code at that point?
323	21:54:17 <anomie> And as active as the revision table.
324	21:54:44 <DanielK_WMDE_> jynus: for the database? sure, there's a patch on gerrit, and i send you sample db dumps. they are not exactly like the proposal, but quite close.
325	21:54:53 <jynus> no, no
326	21:54:56 <jynus> mediawiki code
327	21:55:11 <Scott_WUaS> Great All - thanks!
328	21:55:34 <DanielK_WMDE_> jynus: i have been working on that with brion. lots of moving parts. it's a bit of a hen-and-egg thing
329	21:55:41 <jynus> we will deploy whatever we have, I suppose, and whatever is easy to migrate?
330	21:55:41 <brion> :)
331	21:55:41 <DanielK_WMDE_> you can't write the code before the db schema is final.
332	21:55:55 <jynus> the migration part is the key
333	21:55:57 <DanielK_WMDE_> you can't decide on a schema if it's not clear how it will impact the php code
334	21:56:32 <jynus> I am all for converting tables rather than migrating them (I think we discussed that on the image issue)
335	21:56:34 <DanielK_WMDE_> jynus: yes, i agree. that'S the tricky part. luckily, i have some experience with that, but i will be needing your help.
336	21:57:04 <jynus> I do not think we can give a decision with a plan(?)
337	21:57:12 <jynus> *without
338	21:57:38 <DanielK_WMDE_> jynus: well you can't make a plan when there is no decision on the goal :)
339	21:58:07 <TimStarling> can I just repeat that I am putting my 2c in for MCR fields in the initial content table, with slot=1 always
340	21:58:17 <jynus> I am not blocking that, I am asking what is the general mood, I said I will not take part on this decision, db is not a blocker here
341	21:58:17 <DanielK_WMDE_> jynus: writing complete migration code before it's even clear whether the migration is wanted is not a thing i like to do
342	21:58:20 <TimStarling> because I think that will help motivate Daniel to actually write MCR
343	21:58:31 <brion> :))))
344	21:58:42 <DanielK_WMDE_> TimStarling is a smart guy :)
345	21:59:05 * anomie likes TimStarling's reasoning
346	21:59:12 <robla> #info 14:58:08 <TimStarling> can I just repeat that I am putting my 2c in for MCR fields in the initial content table, with slot=1 always
347	21:59:13 <Scott_WUaS> Aye :)
348	21:59:16 <jynus> decision on the last question, not on the whole issue
349	21:59:26 <DanielK_WMDE_> so, i'll work on finalizing the schema, and propose migration code
350	21:59:32 <jynus> (I was only talking about the last questio, I cannot answer that)
351	21:59:34 <DanielK_WMDE_> then we talk again, where or on wikitech-l
352	21:59:54 <jynus> because I genuinly do not know
353	22:00:30 <DanielK_WMDE_> sorry, which last question?
354	22:00:43 <jynus> the one about the extra columns
355	22:00:53 <robla> ok, I think this was a good meeting; we'll keep using T105652 and wikitech-l for followup. sound good?
356	22:00:58 <stashbot> T105652: RfC: Content model storage - https://phabricator.wikimedia.org/T105652
357	22:01:05 <brion> Yay
358	22:01:05 <DanielK_WMDE_> robla: yep :)
359	22:01:23 <DanielK_WMDE_> jynus: yea... i think i'll shoot for the "medium" proposal for now.
360	22:01:32 <jynus> so can we deploy the schema change soon?
361	22:02:05 <DanielK_WMDE_> jynus: can you confirm that adding fields to a table that is much like the revision table isn't a problem?
362	22:02:19 <robla> jynus: if/when you understand and agree; we don't need to try to force that now
363	22:02:34 <jynus> I cannot confirm it will not be a problem
364	22:02:43 <jynus> schema changes on revision are hard
365	22:02:53 <anomie> DanielK_WMDE_: Can we qualify that? revision has large rows and many rows. The new table will have small(er) rows, but still many.
366	22:02:55 <robla> I think we've settled that it's worth DanielK_WMDE_ to keep working on a proposal and some code/etc
367	22:03:09 <DanielK_WMDE_> jynus: ok. the content table will have roughly the same dimensions as revision. that's why i want to add all fields right away, instead of incrementing.
368	22:03:11 <robla> (with legoktm , et al)
369	22:03:15 <DanielK_WMDE_> let's discuss that on the listz
370	22:03:26 * robla will hit #endmeeting in 60 seconds
371	22:03:39 <anomie> (it might be easier to answer that question with a straw schema)
372	22:03:54 <robla> feel free to continue discussion on #wikimedia-tech
373	22:04:24 <robla> thanks everyone!
374	22:04:26 <robla> o/
375	22:04:31 <robla> #endmeeting

Other meetings

Architecture meetings
13:00 PT ArchCom Planning Meetings	upcoming	all since 2016-03-30
14:00 PT ArchCom-RFC Meetings	upcoming	all since 2015-09-09

Recurring Event
Next
Previous

Event Series: This event is an instance of E66: ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office), and repeats every week.

Details

Invitees

TechCom
• RobLa-WMF
JanZerebecki
MarkAHershberger
daniel

Event Timeline

daniel renamed this event from ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting W33: content model storage (2016-08-17, #wikimedia-office).Aug 13 2016, 7:35 PM

daniel invited: ; uninvited: .

daniel updated the event description. (Show Details)

daniel updated the event description. (Show Details)Aug 15 2016, 11:58 AM

• RobLa-WMF mentioned this in T69223: Schema change for page content language.Aug 17 2016, 7:14 PM

Scott_WUaS changed the event icon from fa-history to Pet Activity.Aug 17 2016, 9:03 PM

Scott_WUaS updated the event description. (Show Details)

Scott_WUaS subscribed.

15:04:32 <wm-labs-meetbot`> Meeting ended Wed Aug 17 22:04:31 2016 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)
15:04:32 <wm-labs-meetbot`> Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-17-21.03.html
15:04:32 <wm-labs-meetbot`> Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-17-21.03.txt
15:04:32 <wm-labs-meetbot`> Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-17-21.03.wiki
15:04:32 <wm-labs-meetbot`> Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-08-17-21.03.log.html

• RobLa-WMF mentioned this in P3846 ArchCom-RFC-2016W33-irc-E261.txt.Aug 17 2016, 10:08 PM

• RobLa-WMF updated the event description. (Show Details)Aug 17 2016, 10:28 PM

• RobLa-WMF updated the event description. (Show Details)Aug 17 2016, 10:30 PM

• RobLa-WMF mentioned this in T105652: RfC: Content model storage.Aug 17 2016, 11:19 PM

daniel mentioned this in T142980: RFC: Create a content meta-data table.Aug 18 2016, 5:19 PM

• RobLa-WMF updated the event description. (Show Details)Aug 19 2016, 3:14 PM

• RobLa-WMF mentioned this in T146749: Consider using Phabricator Calendar events to schedule Wikimedia Developer Summit sessions.Oct 6 2016, 5:57 AM

daniel renamed this event from ArchCom RFC Meeting W33: content model storage (2016-08-17, #wikimedia-office) to ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office).Nov 21 2016, 6:11 PM

daniel changed the host of this event from • RobLa-WMF to daniel.

daniel uninvited: • brooke, Legoktm, jcrespo.

daniel updated the event description. (Show Details)

daniel changed the event icon from Pet Activity to fa-history.

daniel updated the event description. (Show Details)Dec 9 2016, 7:41 AM

daniel renamed this event from ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting W33: content model storage (2016-08-17, #wikimedia-office).

LucasWerkmeister mentioned this in R2494:12b552d98284: Don’t ignore Flake8 E261 rule.Feb 20 2020, 11:34 PM

Restricted Repository Identity mentioned this in rLTSF734dcc84888e: Merge "tox: ignore E261".Apr 2 2021, 4:59 PM

Urbanecm mentioned this in rLTSFe65962987b58: tox: ignore E261.Apr 2 2021, 4:59 PM

LucasWerkmeister mentioned this in R2469:6129b8754910: Fix flake8 E261.Oct 29 2022, 6:21 PM

ArchCom RFC Meeting W33: content model storage (2016-08-17, #wikimedia-office)ActivePublicActions