ArchCom RFC Meeting: Technical aspects of Data namespace blob storage (2016-06-15, #wikimedia-office)

Hosted by daniel on Jun 15 2016, 9:00 PM - 10:00 PM.


Architecture meetings
13:00 PT ArchCom Planning Meetingsupcomingall since 2016-03-30
14:00 PT ArchCom-RFC Meetingsupcomingall since 2015-09-09

Recurring Event

Event Series
This event is an instance of E66: ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office), and repeats every week.

Event Timeline

@Yurik recently posted to wikitech-l:

We have had some good feedback for the new shared tabular data feature, and we are getting ready to deploy it in production. It would be amazing if you can give it a final look-over to see if there are any blockers left.

The first stage will be to enable Data:*.tab pages on Commons, and allow all other wikis direct access to it via Lua code and Graph extension. All data at this point must be licensed under CC0. More licensing options are still under discussion, and can be easily added later.

In line with the "release early, release often", we will not have any elaborate data editing interface beyond the raw JSON code editor for the first release. Our initial target audience is the more experienced users who will evaluate and test the new technology. Once the underlying tech is stable and prooven, we will work on making it more accessible to the general audience.

This was followed with a series of links, including to T134426, http://data.wmflabs.org,
mw:Extension:JsonConfig/Tabular, and a Commons discussion

The subsequent discussion made clear these are (some of?) the relevant tasks:

...and that there are many other tasks in Commons-Datasets. TechCom agreed this would be a good topic for this week's RFC meeting, if @Yurik is available.

I forgot to mention T124569 as an important related RFC, which is what TechCom used as the discussion task in E202 last week. We tentatively agreed to have T124569 as the discussion topic for this week's meeting.

RobLa-WMF renamed this event from ArchCom RFC Meeting: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting: Technical aspects of Data namespace blob storage (2016-06-15, #wikimedia-office).Jun 13 2016, 6:07 AM
RobLa-WMF updated the event description. (Show Details)
RobLa-WMF updated the event description. (Show Details)Jun 13 2016, 6:10 AM

#wikimedia-office: T120452 Technical aspects of Data namespace blob storage

Meeting started by robla at 21:02:08 UTC (full logs).

Meeting summary

Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ (robla, 21:02:35)
https://phabricator.wikimedia.org/T114454 (cscott, 21:10:28)
much of the first half of the discussion was about defining datatypes (robla, 21:30:21)
DanielK_WMDE and yurik agree to try to avoid naming conflicts (e.g. with wikibase types) (robla, 21:32:11)
yurik agrees to create a Phab task for use as an ArchCom-RFC (robla, 21:47:13)
https://phabricator.wikimedia.org/T137929 (yurik, 22:02:52)
conversation will continue at https://phabricator.wikimedia.org/T137929 (robla, 22:03:22)
https://phabricator.wikimedia.org/T137930 (yurik, 22:03:24)
https://phabricator.wikimedia.org/T137930 geojson (robla, 22:03:56)

Meeting ended at 22:05:08 UTC (full logs).

Action items


People present (lines said)

yurik (125)
DanielK_WMDE (66)
cscott (51)
robla (35)
SMalyshev (29)
brion (28)
aude (23)
stashbot (9)
TimStarling (4)
wm-labs-meetbot` (4)
MaxSem (2)
YairRand_ (1)
Krinkle (0)

Generated by MeetBot 0.1.4.


121:02:08 <robla> #startmeeting T120452 Technical aspects of Data namespace blob storage
221:02:08 <wm-labs-meetbot`> Meeting started Wed Jun 15 21:02:08 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
321:02:08 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
421:02:08 <wm-labs-meetbot`> The meeting name has been set to 't120452_technical_aspects_of_data_namespace_blob_storage'
521:02:08 <stashbot> T120452: Allow tabular datasets on Commons (or some similar central repository) (CSV, TSV, JSON, XML) - https://phabricator.wikimedia.org/T120452
621:02:09 <yurik> hi everyone, thx for making it to the discussion that shall change the face of the earth... again
721:02:35 <robla> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
821:02:45 <cscott> let's change all the things!
921:02:58 <yurik> cscott, oh no, not again!
1021:04:12 <SMalyshev> question: so this thing is using content handler from JsonConfig?
1121:04:38 <robla> so, Yuri, things ArchCom just discussed in our last meeting (E212) is "is this a question for ArchCom?" and "should this be a formal RFC?"
1221:04:56 <yurik> SMalyshev, correct
1321:05:34 <DanielK_WMDE> I propose that we don't discuss "on which domain shall it live" today. that's a community/product question which should be discussed elsewhere, i think
1421:05:53 <yurik> robla, there are clearly two topics, just as brion said. The social aspect - should this be on commons - should probably be discussed with the community, and so far the majority of commons seems to be in favour. What I do want this meeting to address is the technical aspects
1521:06:00 <DanielK_WMDE> that is to say, that's the part of this that i think is not a question for archcom
1621:06:21 <yurik> ok, seems like we are in an agreement on that one
1721:06:21 <SMalyshev> yurik: ok. It may be worth it to make it more search-friendly, but I guess that is generic for all JsonConfig things then
1821:06:49 <yurik> SMalyshev, sure, search friendliness is definitly on the todo list
1921:06:51 <DanielK_WMDE> SMalyshev: how well do you think will the structured search interface you are working on work with tabluar data?
2021:07:20 <DanielK_WMDE> would it be possible to search for values in spoecific columns, for example? if the content handler does it right, i mean
2121:07:23 <SMalyshev> DanielK_WMDE: it might, but I wouldn't go that far for starters. I'd just start with something like being able to search the description of the dataset
2221:07:28 <cscott> (i personally like the idea of making a namespace available from every wiki, then letting social pressures move stuff around as necessary. eg if i want to index the winners of the US "Dancing with the Stars" maybe that live on enwiki in Data:DancingWithTheStars.json, since that's US-specific and there's a different "dancing with the stars" in basically every country.)
2321:07:36 <SMalyshev> DanielK_WMDE: search inside the data is much bigger fish to fry
2421:08:12 <cscott> hm. i agree that search can be done later.
2521:08:20 <DanielK_WMDE> it actually shouldn't be much harder than making the description searchable. but yea, let's start6 small
2621:08:21 <SMalyshev> yurik: my next question: is it attached to specific namespace or can be on any namespace?
2721:08:27 <cscott> like search of wiki articles is essentially separate from storage/viewing those articles
2821:08:31 <brion> of course we'll have articles on several different countries' shows in each language...
2921:08:49 <SMalyshev> cscott: not anymore :)
3021:09:06 <cscott> brion: right, at that point it starts to make sense to move it to commons. but the community can decide this themselves organically.
3121:09:12 <SMalyshev> cscott: at least not completely. I'm working on a patch that makes ContentHandler handle both
3221:09:14 <brion> hmmm, interesting
3321:09:35 <cscott> SMalyshev: "not anymore" re search? not sure what you were responding to.
3421:09:36 <DanielK_WMDE> SMalyshev: i suggest to rely on the content model, not the namespace, for any special processing. currently, content models are mostly bound to namespaces, but that can change
3521:09:38 <brion> cscott: so put the system on *every* wiki that doesn't opt-out, plus a common backing, like File: uploads?
3621:09:50 <SMalyshev> cscott: yes, re search. But let's not get too offtopic :)
3721:09:58 <cscott> brion: that would be my suggestion. it also ensures we're running the same software everywhere.
3821:10:12 <cscott> i have a proposal for a new template system that could really use data blob storage. for example.
3921:10:13 <SMalyshev> DanielK_WMDE: yes, I agree. But how you specify that you want to create page in certain content model?
4021:10:19 <brion> *nod* sensible :D
4121:10:27 <yurik> cscott, i am not too happy about allowing it everywhere. We already got burnt by it waaay too many times - i would much rather keep it in one place, allow multiple wikis to reuse it, and name it accordingly - "dancing with the stars US 2015"
4221:10:28 <cscott> https://phabricator.wikimedia.org/T114454
4321:10:50 <DanielK_WMDE> SMalyshev: currently, via the namespace or a title suffix (like .css or .js)
4421:10:56 <cscott> yurik: sure, default should be commons, but that should be social pressure not technical restriction.
4521:11:02 <SMalyshev> DanielK_WMDE: ok, got it
4621:11:11 <SMalyshev> sounds reasonable
4721:11:17 <DanielK_WMDE> SMalyshev: but that'S just at the time of creation. after that, we should just look at the model associated with the page
4821:11:26 <robla> #chair DanielK_WMDE brion robla TimStarling Krinkle
4921:11:26 <wm-labs-meetbot`> Current chairs: DanielK_WMDE Krinkle TimStarling brion robla
5021:11:32 <DanielK_WMDE> it's actually stored separately for every revision
5121:11:41 <yurik> cscott, i would wait until community explicitly demands that feature. It is much easier to enable than to disable
5221:11:55 <yurik> we can always enable it later if needed, but disabling is next to impossible
5321:12:05 <yurik> so for the step 1 i would really like just 1 wiki
5421:12:09 <DanielK_WMDE> yurik: T120452 says "CSV, TSV, JSON, XML". But we are down to just a specific JSON format now, right?
5521:12:10 <stashbot> T120452: Allow tabular datasets on Commons (or some similar central repository) (CSV, TSV, JSON, XML) - https://phabricator.wikimedia.org/T120452
5621:12:11 <yurik> and wait for the feedback
5721:12:18 <SMalyshev> only thing is that now we are using model that's called JsonConfig for some things that are definitely not config... but I guess it's too late to rename...
5821:12:30 <yurik> DanielK_WMDE, my first goal is to get two types of data out : tabular and maps
5921:12:41 <cscott> yurik: will https://phabricator.wikimedia.org/T91162 let me refer to the data store on commons as if it were local? (like instantcommons does)?
6021:12:50 <yurik> maps would allow geojson storage (so that all maps can overlay with extra stuff), and tabular - .... is tabular :)
6121:13:00 <DanielK_WMDE> yurik: maps as in geo-shapes?
6221:13:05 <yurik> DanielK_WMDE, yes
6321:13:06 <DanielK_WMDE> or full maps?
6421:13:08 * robla will brb
6521:13:08 <cscott> yurik: one issue might be (for example) country-specific data, which then gets hung up on different countries/wiki's ideas of which are valid countries
6621:13:09 <yurik> and pushpins
6721:13:12 <cscott> ie, is "taiwan" a country?
6821:13:37 <cscott> yurik: allowing zhwiki to override the table from commons with a zhwiki-specific table is a valuable way to defuse that situation
6921:13:50 <cscott> and i think wikidata already has something like this, where certain facts are only true for certain wikis?
7021:13:59 <DanielK_WMDE> cscott: they can just choose to use a different table. much simpler solution.
7121:14:13 <cscott> DanielK_WMDE: how does wikidata handle this?
7221:14:15 <yurik> cscott, no, not via shadow. The current implementation will only target Lua and Graph users at first, which means Lua will simply say mw.data.get('Page.tab') - and use that data
7321:14:23 <SMalyshev> yurik: I wonder if we need a higher level API to operate with such data. I.e. if I want to store a tabular data set, I don't want to need to know specific JSON scheme (which could also change)...
7421:14:23 <DanielK_WMDE> but i thin the "where does it live" question is out of scope here.
7521:14:28 <yurik> we can add shadow later if requested
7621:15:10 <SMalyshev> yurik: and by operate I mean not just read (Lua probably covers that) but also write
7721:15:15 <TimStarling> presumably Lua will see the decoded object, not the JSON-encoded format?
7821:15:22 <cscott> i'm fine with deploying first without instantcommons/shadow and on a single wiki, but i'd like to state for the record that, if this functionality turns out to be useful, we'll eventually need that functionality. we should ensure that we're not *foreclosing* that possibility, even if we're not initially enabling it.
7921:15:29 <DanielK_WMDE> cscott: in theory, wikis can pick specific statements over the default for a specific property, or they can filter by well-known authorities being cited as sources. i don't think this is actually being done, but the data model is specifically designed to allow this
8021:15:29 <yurik> cscott, the idea here is to provide the most basic usage that will cover 80%. If we get strong desire for 1) multi-wiki storage, or 2) multi-wiki overrides, we can always add that
8121:16:02 <yurik> TimStarling, correct
8221:16:17 <yurik> TimStarling, more specifically, Lua will get the json as table
8321:16:40 <DanielK_WMDE> yurik: i see some overlap between the use cases of wikidata queries and tabular data. it would be nice if the formats and interfaces would be very similar, of not identical.
8421:16:40 <yurik> so that it has access to all the meta fields
8521:16:47 <yurik> BUT, we can provide additional helpers to resolve the multi-lingual resolution
8621:16:47 <cscott> again, not many people have grokked T114454 yet, but the basic idea there is to separate code, content, and presentation, so every template potentially will have a "data" component, along with the "code" and the "presentation" component.
8721:16:47 <stashbot> T114454: [RFC] Visual Templates: Authoring templates with Visual Editor - https://phabricator.wikimedia.org/T114454
8821:17:02 <yurik> DanielK_WMDE, agree - we have discussed it briefly with Lydia_WMDE
8921:17:41 <cscott> assuming some basic separation of that happens in the future, we'll want the data namespace to be roughly on par with the template namespace. ie, shadowed from a default on commons, override-able from specific wikis.
9021:17:56 <YairRand_> maybe have data blobs be a wikidata datatype, like commons file
9121:18:05 <DanielK_WMDE> yurik: to be compatible with wikidata, the representation of data values would have to become more complex. but i'm not sure whether we should require that, or offer it as an optional ferature.
9221:18:12 <DanielK_WMDE> yurik: btw, multilingual is pretty complex when it gets to the nitty gritty. do you think it's really needed from the start?
9321:18:14 <yurik> DanielK_WMDE, lets sync up afterwards to see if we can match wikidata api with this, or if they should go different routes
9421:18:46 <brion> could also do things like instead of multilingual text, refer to a wikidata item and then look up its data by name .... though that may have perf issues with large batches
9521:18:47 <brion> :D
9621:18:49 <DanielK_WMDE> yurik: not tonight, i'm going to bed after this :) will you be at wikimania?
9721:19:06 <yurik> DanielK_WMDE, absolutelly - i really think simple "multi-lingual" feature that will allow a fallback is something we need from the start
9821:19:07 <TimStarling> ideally Lua would see a read-only wrapper object like what is returned from mw.loadData()
9921:19:12 <DanielK_WMDE> brion: yes, wikidata Q-id is a very useful datatype to have
10021:19:16 <yurik> DanielK_WMDE, sadly, no Wikimania for me - no budget :(
10121:19:28 <TimStarling> that way you don't have to clone it for each #invoke instance
10221:19:39 <DanielK_WMDE> yurik: sad. let's find another time and place then.
10321:20:01 * aude wavez
10421:20:10 <DanielK_WMDE> yurik: it would be nice to re-use the language-fallback stuff we have in wikidata. we should factor it out into a library, i guess.
10521:20:15 <yurik> DanielK_WMDE, indeed it is :( Yes, lets. I will schedule a hangout with you. Anyone else - pls poke me if you want to parttake-
10621:20:19 <DanielK_WMDE> hey aude
10721:20:33 <yurik> DanielK_WMDE, sure, but i already have something like that for the zero banners that i'm reusing
10821:20:35 <yurik> but sure
10921:20:58 * robla returns to the meeting he's allegedly chairing :-)
11021:21:11 <SMalyshev> yurik: so what do you think about having higher-level API to manipulate specifically the tabular data?
11121:21:17 * yurik thinks chairing !== participating ;)
11221:21:40 <yurik> SMalyshev, "manipulate" is out of the scope i think at this point. I'm all for it though :)
11321:21:54 <DanielK_WMDE> yurik: do you want to start out with your own (simpler) data types for now (and spend tiem to specify them properly)? or do you want to go with the representation that is used by the wikidata? that's already in reusable libraries
11421:21:55 <SMalyshev> yurik: thinking forward :)
11521:21:57 <yurik> especially because i can totally see some pages being custom-defined to store data in the backing SQL
11621:22:03 <yurik> SMalyshev, ^
11721:22:38 <yurik> DanielK_WMDE, i would like to match datatypes in wikidata as much as possible, but probably only provide a subset of them from the start
11821:23:04 <SMalyshev> yurik: well, there's two venues here: a) run external query, store results on wiki (I don't want to know too much details about how wiki stores it)
11921:23:14 <DanielK_WMDE> yurik: yea, an "table aware" storage backend is an interesting idea, and fits in with the blob-store refactoring i'm thinking about. but it's for later.
12021:23:14 <yurik> if that means re-implementing some of it first - lets, because otherwise we might spend years making it perfect only to realize that community needs are totally orthogonal to wikidata usage
12121:23:25 <SMalyshev> yurik: and b) represent internal query as data set on wiki (e.g. WDQS query)
12221:23:47 <yurik> SMalyshev, yep, that's what DanielK_WMDE is talking about i think. But again, lets not discuss it now :)
12321:23:49 <SMalyshev> all that needs clean API so that clients don't know too much
12421:23:57 <yurik> otherwise we might be redesigning SQL engine next ;)
12521:24:16 <SMalyshev> that's why I mention it - if we make it too specific now, it'd be hard to change it ater
12621:24:18 <SMalyshev> *later
12721:24:47 <yurik> SMalyshev, that i agree. But remember, the use case here is for Lua to GET EVERYTHING, and deal with it. If we say we want SQL-like GET EVERYTHING THAT MATCHES THE WHERE CLAUSE, we might get into all sorts of weird issues
12821:25:13 <yurik> especially because we might go the route that is not needed (yet or ever)
12921:25:26 <brion> huge-data-set needs are quite different yes
13021:25:27 <DanielK_WMDE> yurik: ok then. have you looked at how wikidata represents data values? e.g. look at https://www.wikidata.org/wiki/Special:EntityData/Q42.json
13121:25:38 <SMalyshev> yurik: Lua is a good enough API if we don't get too specific about the structure
13221:26:08 <yurik> exactly brion - that's what we actually discussed in the task earlier - dealing with large datasets is a very different beast, with a different reqs
13321:26:11 <aude> DanielK_WMDE: i'm not sur eabout duplicating all the json for each value
13421:26:12 <DanielK_WMDE> yurik: e.g. we have something like {"snaktype":"value","property":"P577","datavalue":{"value":{"time":"+2002-01-01T00:00:00Z","timezone":0,"before":0,"after":0,"precision":9,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"},"type":"time"},"datatype":"time"}
13521:26:13 <stashbot> P577 (An Untitled Masterwork) - https://phabricator.wikimedia.org/P577
13621:26:24 <aude> if all the values have the same calendar, e.g.
13721:26:30 <DanielK_WMDE> yurik: the "value" thing is what i think should be in your table fields.
13821:26:54 <aude> then have calendar, before /after + array of timestamps?
13921:27:15 <DanielK_WMDE> yurik: the "time" data type is akind of a nice nasty example. it'S json representation really isn't too great, and i'd love to change it... we'll probably have to use a new type id for the new version, not sure yet
14021:27:24 <aude> maybe precision might vary though
14121:27:37 <yurik> DanielK_WMDE, i would prefer to go with the tabular data as defined by the industry for tabular data (see the bug), but for specific datatypes like time - sure
14221:28:01 <DanielK_WMDE> aude: yea, i at least wouldn't dublicate the time. we could have a "defaults" row, that gets merged into every value.
14321:28:11 <aude> the json is very verbose
14421:28:12 <DanielK_WMDE> a bit hacky, but woudl work...
14521:28:21 <aude> just saying...
14621:28:24 <yurik> DanielK_WMDE, btw, time is not part of this proposal just yet :)
14721:28:32 <yurik> too complex to have it in ver1
14821:28:42 * aude agrees with yurik
14921:28:49 <aude> start simple
15021:29:06 <yurik> it can be easily added later - simply add a new type, and make the value object mean as DanielK_WMDE described above
15121:29:22 <DanielK_WMDE> yurik: yea, sure. for string-based types, sime literals work fine. for numbers, too. once we get into measured quantities, things get more complex
15221:29:38 <yurik> DanielK_WMDE, default row is fairly complex - should it be the "null" that gets used as the default?
15321:29:41 <aude> counts could be ok
15421:29:45 <SMalyshev> oh let's not get into units :)
15521:30:07 <yurik> agree, lets get back to overall strategy :)
15621:30:13 <DanielK_WMDE> yurik: we should make sure to avoid naming conflicts. if you define a type name and use it with a different format than wikibase does, that will become annoying
15721:30:21 <robla> #info much of the first half of the discussion was about defining datatypes
15821:30:24 <yurik> DanielK_WMDE, agree
15921:31:14 <DanielK_WMDE> yurik: the idea with the defaults row wats that you can e.g. say that all dates in a column use the same caledar, or all coordinates refer to earth, without havign to repeat that info for every field. but that's an optimization that can be added later.
16021:31:49 <yurik> ver1 datatypes: strting, numbers, multilingual strings. I don't even know if i want to allow bools for now.
16121:32:11 <robla> #info DanielK_WMDE and yurik agree to try to avoid naming conflicts (e.g. with wikibase types)
16221:32:13 <aude> multilingual gets somewhat complex
16321:32:23 <yurik> these three types should cover almost 90% of the usecases from the start - simply because it will be Lua doing the processing and presenting of the dat
16421:32:24 <yurik> data
16521:32:25 <SMalyshev> yurik: bools are just numbers 1/0. Or strings yes/no :)
16621:32:33 <aude> and numbers (units? no units / counts?)
16721:32:34 <DanielK_WMDE> yurik: i think the way you represent multilingual is different from what the DataValues lib does. but wikidata doesn't use multilingual yet, so it can be changed
16821:32:44 <yurik> aude, simple JSON numbers
16921:32:55 <yurik> which means if you want units, you add a string column
17021:32:56 <aude> yurik: like counts
17121:33:00 <SMalyshev> aude: I don't think we need units and associated headache. We havent' properly figured them on wikidata even
17221:33:14 <aude> SMalyshev: that's why i am asking :)
17321:33:16 <DanielK_WMDE> robla: i think data types are a crucial issue. but i agree that we should leave some room for other topics ;)
17421:33:41 <yurik> DanielK_WMDE, lets make this part of our wikidata-jsonconfig sync up meeting
17521:34:06 <yurik> are there any other issues that people are concerned about?
17621:34:22 <aude> yurik: btw, maybe you can visit berlin before SOTM in belgium?
17721:34:27 * DanielK_WMDE thinks that values from a query api will actually be full "snaks"...
17821:34:30 <SMalyshev> yurik: are there any limits on how big it can get?
17921:34:36 <aude> and we can talk more of the details
18021:34:44 <yurik> SMalyshev, 2mb - same as a wiki page
18121:34:50 <DanielK_WMDE> oh, a visit sounds nice!
18221:34:53 <SMalyshev> ok
18321:34:55 <yurik> because it uses storage engine
18421:35:36 <DanielK_WMDE> do you think we will want to expand to very large data sets later?`
18521:35:37 <robla> yurik: can/should you formalize T120452 as an ArchCom-RFC?
18621:35:38 <stashbot> T120452: Allow tabular datasets on Commons (or some similar central repository) (CSV, TSV, JSON, XML) - https://phabricator.wikimedia.org/T120452
18721:35:53 <yurik> DanielK_WMDE, aude, do we really want to wait until september to deploy this? JsonConfig has been in production for the past 2 years, for all wikis (as part of the zero system)
18821:35:55 <robla> (at least the technical side)?
18921:36:28 <yurik> DanielK_WMDE, i don't want to tackle large datasets until after this thing has had some usage, e.g. half a year
19021:36:30 <robla> (perhaps T134426 is the right one to focus on)
19121:36:30 <stashbot> T134426: Review shared data namespace (tabular data) implementation - https://phabricator.wikimedia.org/T134426
19221:36:51 <DanielK_WMDE> yurik: that's a pretty brisk pace ;)
19321:37:04 <yurik> DanielK_WMDE, agree, i will wait a year until large datasets :D
19421:37:05 <DanielK_WMDE> if you really want to support data sets by then, you better start thinking abotu that early
19521:37:33 <yurik> but yes, it should be in the back of our minds, but shouldn't be fully speced until later
19621:38:03 <yurik> robla, i think there is another task that formalizes how the system works
19721:38:05 * yurik looks
19821:39:13 * robla waits patiently
19921:40:23 <DanielK_WMDE> how about directly transclusing a table into a wiki page? how would that work? do we newed that? or do we rely on lua for that?
20021:40:29 <yurik> robla, i think its in https://www.mediawiki.org/wiki/Extension:JsonConfig/Tabular
20121:40:55 <yurik> DanielK_WMDE, even though i do have it implemented (as a template expansion), i don't think its a usecase
20221:41:02 <yurik> simply because there is really no big reason for it
20321:41:18 * robla notes that the Extension:JsonConfig points to T120452
20421:41:18 <yurik> it is always very usage dependent - e.g. show a list generated from a table
20521:41:36 <DanielK_WMDE> yea, you'd always want som custom stuff anyway
20621:41:39 <robla> yurik: is T120452 the right Phab task?
20721:41:40 <stashbot> T120452: Allow tabular datasets on Commons (or some similar central repository) (CSV, TSV, JSON, XML) - https://phabricator.wikimedia.org/T120452
20821:42:35 <DanielK_WMDE> can we remove the formats from the title? i think they are misleading now.
20921:43:09 <yurik> DanielK_WMDE, agree, but please keep in mind that as part of this discussion i would really like geojson (maps overlays) to be agreed on as well
21021:43:20 <yurik> actually geojson is much simpler than tabular
21121:43:33 <yurik> it is a well established format, and we are already heavily using it in maps
21221:43:36 <robla> yurik: what Phab task do you want to declare as an ArchCom-RFC?
21321:43:55 <yurik> robla, that one is fine i think - we some refining to the title and description
21421:43:55 <cscott> I wonder if extensions are really the right way to select data type?
21521:44:06 <yurik> cscott, what do you mean?
21621:44:22 <cscott> since there is some discussion of types, for instance, it might be that we start with a very simple "json" but later have a more typeful "json" with the date-type figured out, etc.
21721:44:31 <cscott> .json is going to get overloaded quickly
21821:44:40 <cscott> mime types would be much nicer
21921:44:47 <cscott> but then that begs the question of where they get stored
22021:44:51 <yurik> cscott, i actually don't want .json - it will be heavily misused from the begining, no?
22121:44:58 <cscott> still, storing the data type separately from the data/article name is not a bad thing.
22221:44:59 <yurik> and we won't be able to do some proper editing
22321:45:15 <DanielK_WMDE> cscott: internally, it will be represented as a content model id. the extension is one way to indicate that.
22421:45:17 <yurik> if we from the begining define a rigid structure, we can add useful tools
22521:45:33 <yurik> so for tabular, VE can have a nice editor of values (like a spreadsheet)
22621:45:39 <SMalyshev> .json is way too generic... I think it'd be better if tables and geojson had their own spaces
22721:45:40 <yurik> actually it won't even be a VE on commons
22821:45:40 <cscott> DanielK_WMDE: will we be able to eventually just associate a mime type with the content?
22921:45:43 <DanielK_WMDE> hm... will geo-shapes and tables live in the same namespace? with different suffixes/extensions?
23021:45:53 <robla> yurik: If that becomes an ArchCom-RFC, then you won't be the assignee, and Danny Horn will be the author. Is that the desired outcome?
23121:45:56 <yurik> DanielK_WMDE, yes
23221:46:05 <cscott> i don't mind Data: as the namespace. i'd rather have that than GeoJson: Tables: etc etc
23321:46:14 <yurik> robla, i hear you, ok, i will create a new task
23421:46:37 <yurik> DanielK_WMDE, example: Data:Don Qixote Trip in Spain.geojson
23521:46:37 <DanielK_WMDE> cscott: a content model id, not a mime type. the mime type specifies a serialization format, like json or xml. that's also stored, but kind of redundant. the important info is what model/vocabulary/scheme the data is using.
23621:46:45 <cscott> fwiw Scribunto/JS has this same issue -- there's no way to specify which *language* the module is in, in the Module: namespace.
23721:46:49 <MaxSem> I agree with cscott
23821:46:51 <DanielK_WMDE> cscott: we already do that. that's how contenthandler works.
23921:46:57 <brion> my one concern about separating type is that if a table changes type, will that break usage? :)
24021:47:11 <yurik> that's why we from the begining define extensions
24121:47:13 <robla> #info yurik agrees to create a Phab task for use as an ArchCom-RFC
24221:47:19 <brion> (eg if you change an image from .png to .svg you can still use it the same from wiki side, but for tables it may matter more)
24321:47:22 <cscott> brion: possibly, but that's no different from a rename breaking usage, or any other edit breaking usage.
24421:47:28 <yurik> JsonConfig will be set up to only allow pages that match REGEX
24521:47:32 <brion> *nod*
24621:47:36 <DanielK_WMDE> cscott: you could indeed use a file extension to indicate whether a modules is JS or Lua. Just add .js or .lua
24721:47:40 <brion> and we really should rename JsonConfig ;)
24821:47:44 <yurik> so it will be Data:.*\.tab
24921:47:54 * cscott is not a fan of file name extensions
25021:47:56 <yurik> no other pages will be creatable in the data namespace
25121:48:00 <cscott> not i18n friendly
25221:48:04 <DanielK_WMDE> cscott: internally, that would just define the content model to use when creating the page
25321:48:06 <cscott> not human friendly, really
25421:48:18 <yurik> cscott, the only other option is to have multiple namespaces - and the community (and i personally) really hate that
25521:48:35 <brion> well, the other option is to have some sort of content model selection in the creation process
25621:48:40 <DanielK_WMDE> i kind of like to have that info in the title, cscott
25721:48:41 <brion> which implies UI etc
25821:48:44 <cscott> no, i'm just saying that the content model should be defined separately (as DanielK_WMDE indicates is already the case under the covers) and not rely on filename extensions
25921:48:45 <yurik> brion, sure, that can also work
26021:49:15 <cscott> DanielK_WMDE: but the info in the title doesn't mean anything unless you speak english -- or "hacker english" at least
26121:49:29 <cscott> and "geojson" doesn't really mean anything to even english speakers
26221:49:41 <DanielK_WMDE> cscott: i like to do both. we *can* handle different models without any indicator in the title, but it's *nice* to have that indicator there. we already do this for .css and .js in the MediaWiki and User namespaces
26321:49:56 <yurik> cscott, brion, we could create an elaborate system for model selection - is that an absolute blocker/requirement? I really feel that since data will be very technically oriented, people will actually find it better usable
26421:50:01 <cscott> DanielK_WMDE: that will probably have to be good enough for now.
26521:50:02 <yurik> just like we have File:Blah.json
26621:50:16 <yurik> exactly
26721:50:21 <yurik> i really like that indicator
26821:50:44 <TimStarling> "geojson" hopefully means something to the people who are writing lua modules
26921:50:45 <cscott> yurik: i'd just like it clear during the document/evangelization process that filename extensions may be a convenient *shortcut* for specifying the data type, they are only a stopgap and not strictly speaking required. (especially if your native language is not english)
27021:50:47 <brion> i'm happy enough with extensions given the existing ecosystem
27121:51:09 <cscott> hopefully we'll eventually have more robust article metadata editors, so you can just directly edit the content model
27221:51:12 <yurik> remember that we are targeting a very tech savvy community with this until a nice editor system is in place. And when it is, I wouldn't mind a VE to edit the data remotely, without even switching to commons (like we do in Wikidata)
27321:51:28 <brion> mmmm, spreadsheet editor
27421:51:30 <cscott> brion: and i'm lobbying against them based on where i'd like to see the ecosystem eventually go. ;)
27521:51:31 <robla> cscott: file extensions and file types are tied up with one another, despite years of standards bodies trying to make that not be true
27621:51:33 <yurik> brion, exactly
27721:51:37 <DanielK_WMDE> i notice we are getting close to the end of the meeting.
27821:51:46 <DanielK_WMDE> are there any thoughts or comemnts about geojson?
27921:51:53 <yurik> brion, T134618
28021:51:53 <stashbot> T134618: Implement spreadsheet-like cell editing for tabular data - https://phabricator.wikimedia.org/T134618
28121:51:54 <DanielK_WMDE> yurik: how do youo render geo shapes?
28221:51:56 <cscott> "tech savvy community" == we systematically exclude potential community members who are not tech savvy
28321:51:59 <cscott> that's what i hear, at least
28421:52:04 <brion> and if you really want to have fun with file extensions <-> type, try dealing with video containers vs codecs! </runs away>
28521:52:19 <yurik> DanielK_WMDE, easy - you just put that geojson inside the <mapframe>...</> wikitext element :)
28621:52:34 <robla> brion: amen
28721:53:06 <brion> cscott: a legit concern, yes
28821:53:15 <DanielK_WMDE> yurik: so there is a hard dependency on the maps extension?
28921:53:16 <yurik> cscott, i am by no mean trying to exclude them, but rather understand the users. Non-tech savvy community is the ones that will provide the most value (simply because there is probably a bigger nontechsavy community there), but we should make it nicer and easier for them.
29021:53:17 <brion> usability will become a bigger concern once there are tools built up on top of this system
29121:53:26 <yurik> DanielK_WMDE, when supporting geojson as storage - yes
29221:53:29 <cscott> so long as the file extensions aren't baked hard into the design, i'm happy.
29321:53:32 <brion> eg if you already have graphing/table-formatting templates+lua modules ready to use
29421:53:34 <DanielK_WMDE> yurik: what do you do if it's not there? just show json as text?
29521:53:36 <brion> and a good editor
29621:53:41 <DanielK_WMDE> would be ok-ish, i guess
29721:53:42 <cscott> just like i'm happy so long as we can *eventually* enable shadow namespaces or instantcommons on this
29821:53:44 <yurik> DanielK_WMDE, we could - as a backup
29921:54:03 <yurik> cscott, i am having very big doubts about shadow namespaces to be honest
30021:54:05 <DanielK_WMDE> cscott: +1
30121:54:06 <robla> I think the file extension issue needs to go to wikitech-l
30221:54:09 <yurik> but that's a separe discussion :)
30321:54:22 <cscott> yurik: instantcommons then. or data: namespaces on every wiki. what you will.
30421:54:55 <yurik> cscott, i'm not against it, just doubting the long term viability of it ;)
30521:55:10 <cscott> i have faith in kartik ;)
30621:55:11 <yurik> but again, we can totally support it if we decide that's the way forward
30721:55:18 <DanielK_WMDE> can we confirm that geojson is good to go? i have no objection, but i also know next to nothing about it
30821:55:32 <robla> so yurik, thanks for bringing this conversation up on wikitech-l generally. I think there's a lot more to discuss here...and I'm not sure how to do it
30921:55:33 <DanielK_WMDE> is anyone around who aqctually knows something about geojson?
31021:55:37 <cscott> sure. that's all i'm lobbying for at the moment. leave space for the future, don't do anything that would make it impossible later.
31121:55:39 <yurik> DanielK_WMDE, https://www.mediawiki.org/wiki/Help:Extension:Kartographer
31221:55:40 <brion> so the alternative on the extension is probably "don't enforce an extension, have everything in the Data: namespace be this tabular format _for now_"
31321:55:42 <MaxSem> DanielK_WMDE, /me
31421:55:43 <yurik> it has a geojson sample
31521:55:43 <robla> DanielK_WMDE: I don't want to confirm anything in this meeting
31621:55:51 <brion> with an eventual UI/API extension for picking different content model
31721:56:14 <SMalyshev> DanielK_WMDE: well, I know a little about it... nothing that would prevent us from having it on wiki as format :)
31821:56:17 <DanielK_WMDE> robla: ok, check that there are no objections at this time ;)
31921:56:28 <yurik> brion, i'm not too happy about that - i would much rather say "for now, lets only allow pages in the Data: that match the extension"
32021:56:39 <yurik> this way we can put geojson there as well
32121:56:43 <yurik> and other formats
32221:56:49 <brion> is geojson ready to go?
32321:56:54 <yurik> brion, yep
32421:56:57 <brion> ah fun
32521:56:57 <yurik> it is much easier
32621:57:02 <DanielK_WMDE> robla: as in humming ;)
32721:57:06 * robla doesn't feel like he understands what's being proposed to have had a chance to object
32821:57:08 * aude and soem other people implemented geojson content handler in zurihc
32921:57:14 <aude> 2 years ago?
33021:57:14 <yurik> geojson is very straight forward - we already have it as part of kartograhper ext
33121:57:27 <cscott> i'd say "the content model of the page is defined at page creation type by the extension. but nothing after that point tries to parse the article title for an extension"
33221:57:37 <yurik> cscott, agree
33321:57:43 <aude> not sure exactly how it would work now, but think it's not too complex
33421:57:53 <DanielK_WMDE> cscott: yes, absolutely.
33521:57:58 <brion> cscott: that seems sensible yeah
33621:58:04 <cscott> it also potentially means you could work around the need for an extension by sneaky renames. ;)
33721:58:10 <brion> and allows for the future to drop the extension at creation time
33821:58:12 <yurik> cscott, the only limitation - jsonconfig will not allow renaming if the target page name does not match the original regex
33921:58:13 <DanielK_WMDE> that's how it works for .js and friends
34021:58:25 <cscott> in lieu of having a proper direct edit mechanism for the content model
34121:58:31 <yurik> yep
34221:58:55 <cscott> yurik: yeah, i'm okay with the rename limitation for now. i just don't want to code to have regexp matches against the page title scattered everywhere.
34321:58:59 <robla> we're running out of time. very good discussion; I think I know how to pull open questions out, but I'm not volunteering to do it.
34421:59:13 <aude> not sure we need hard dependency on kartographer
34521:59:15 <yurik> cscott, oh, thats not there . The content id is stored with the page
34621:59:18 <brion> for the .js/.css subpages we also just have predictable naming which is what the exts are for, something not relevant for the primary item
34721:59:20 <DanielK_WMDE> yurik: people got upset when they couldn't rename a misnamed foo.jd to foo.js, because the content model mismatched ;) so now they can re-decalre the content. a bit scary...
34821:59:22 <yurik> aude, its a soft dep
34921:59:36 <aude> yurik: or some generic stuff could be seaprated and used by both things
35021:59:38 <yurik> just like many extensions depend on syntax highlighter
35121:59:38 <robla> yurik: where should people who are interested continue this discussion?
35222:00:47 <yurik> robla, i guess I should create a new task "deploy" ?
35322:00:52 * robla plans to type "#endmeeting" by 22:05 UTC
35422:00:54 <yurik> as we discussed earlier ?
35522:01:05 <cscott> (brion: .js/.css subpages are a bit weird since browsers and webservers still do content-type sniffing based on url extension and other factors; that shouldn't be relevant to the data namespace which is for internal mediawiki use, not for directly serving to web browsers)
35622:01:11 <robla> yurik: could you file a quick placeholder task?
35722:01:19 <yurik> sec
35822:01:32 * robla wishs Phab allowed reassigning the submitter
35922:01:51 <brion> cscott: when we serve them as JS/CSS content it's through RL's load.php; their URLs don't end in .js or .css at all :)
36022:02:52 <yurik> https://phabricator.wikimedia.org/T137929
36122:03:00 <robla> thanks!
36222:03:22 <robla> #info conversation will continue at https://phabricator.wikimedia.org/T137929
36322:03:24 <yurik> https://phabricator.wikimedia.org/T137930
36422:03:31 <yurik> robla, ^ geojson
36522:03:37 <yurik> should there be one common one?
36622:03:51 <yurik> that discusses the underlying tech? like extensions, etc
36722:03:56 <robla> #link https://phabricator.wikimedia.org/T137930 geojson
36822:04:08 <aude> thanks yurik :)
36922:04:22 <aude> suppose maybe we can also talk at SOTM US :)
37022:04:36 <yurik> ok, if needed, i will create another task later
37122:04:41 <robla> let's treat T137929 as the parent task
37222:04:41 <stashbot> T137929: Enable shared tabular data storage on a shared wiki - https://phabricator.wikimedia.org/T137929
37322:04:54 <robla> ok....let's end the meeting
37422:04:57 <yurik> :)
37522:05:04 <robla> thanks all!
37622:05:08 <yurik> thanks robla!
37722:05:08 <robla> #endmeeting

daniel renamed this event from ArchCom RFC Meeting: Technical aspects of Data namespace blob storage (2016-06-15, #wikimedia-office) to ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office).Nov 21 2016, 6:11 PM
daniel changed the host of this event from RobLa-WMF to daniel.
daniel invited: ; uninvited: .
daniel updated the event description. (Show Details)
daniel updated the event description. (Show Details)Dec 9 2016, 7:43 AM
daniel renamed this event from ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting: Technical aspects of Data namespace blob storage (2016-06-15, #wikimedia-office).