Page MenuHomePhabricator

Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...)
Closed, ResolvedPublic

Description

It would be good to have somewhere central that tabular datasets could be uploaded to -- eg the data to be plotted by the graphs extension.

The data should be uploadable and exportable in a variety of standard formats -- eg CSV, TSV, JSON, XML -- and stored in a format-neutral way.

Wikidata is great for single values, or values to be scattered across multiple items, but it's not so good for data in a systematic 1-dimensional or 2-dimensional form.

Jheald (talk) 13:24, 13 November 2015 (UTC)

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

This proposal received 4 support votes, and was ranked #88 out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Allow_tabular_datasets_on_Commons_.28or_some_similar_central_repository.29

Which wiki to use for storage is discussed at https://meta.wikimedia.org/wiki/User:Yurik/Storing_data

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We already have pretty robust support for pages containing text.

Not if your data-set is 100 mb big. I think this bug maybe needs some scope clarification before deciding which approach is best.

I think the best way forward is to post on commons proposing to add a new namespace there.

This might not be the easiest sell to the commons community. Well its possible they might like the idea, I would recommend thinking hard about how this might impact them, in order to have good answers to questions they might ask (e.g. Things that would likely come up are how are datasets attributed, and are the patrolling/anti-vandalism tools sufficient to deal with this type of content)

ekkis added a comment.Mar 10 2016, 8:21 PM

I think there's a very valuable use-case in data sets well below the 100mb range, particularly, all manner of reference data. think of a list of countries, cities within countries, telephone area codes. the names of HTML entities, colour name/codes, et cetera... or hierarchies used for classification purposes as are used in banking.

the reason I made this request to begin with is because I took the trouble to extract data that was already available in the wikipedia, but presented as an HTML table. I formatted it and then wanted to share it as a TSV, which is a very light format (but JSON would be awesome too) such that it could be consumed easily by anyone e.g. a script can ftp the data right from the source, and consume it.

that helps developers save each other time and great effort

I think there's a very valuable use-case in data sets well below the 100mb range

That was just a general example, because it wasn't clear to me if this was being sold as a general purpose solution for all data-sets. 4 mb is the actual limit for wikipages, and honestly, things get kind of sucky once you go > 1 mb.

As an aside, I'd note that if we're making up our own custom data formats, Lua already has a load data thing for loading data sets formatted as lua tables.

ekkis added a comment.Mar 10 2016, 9:16 PM

Lua tables

I know nothing of that technology but if it can be parsed readily in any environment, that would be fine, otherwise TSVs or JSON are very common formats

Yurik added a comment.Mar 10 2016, 9:16 PM

@Bawolff Lua supports mw.text.jsonDecode(), which is great for these pages - if we keep TSV in a schema like

{
  "columns": ...
  "data":[
    [1,2,3],
    [4,5,6],
    ...
  ]
}
-jem- added a subscriber: -jem-.Mar 25 2016, 11:48 PM

Change 281331 had a related patch set uploaded (by Yurik):
Implement tabular content support

https://gerrit.wikimedia.org/r/281331

Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptApr 13 2016, 5:40 PM
Yurik added a comment.Apr 13 2016, 5:45 PM

The above patch allows tabular data storage with string, numeric, and "localized" column types. Localized is basically a languagecode->string. Also license and description meta data is allowed (and more can be easily added).

Nuria edited projects, added Analytics; removed Analytics-Kanban.Apr 15 2016, 3:38 PM

Change 281331 merged by jenkins-bot:
Implement tabular content support

https://gerrit.wikimedia.org/r/281331

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 5:48 PM
matmarex reassigned this task from Milimetric to Yurik.

So I guess Yurik is working on making this happen? I'm not sure what is the JsonConfig extension or why it got merged into it.

Yurik added a comment.EditedApr 19 2016, 7:48 PM

merged, please take a look at https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_data_storage_for_Commons.21

Demo at http://data.wmflabs.org/wiki/Data:Sample.tabular?uselang=fr (try different languages)

@matmarex , JsonConfig is an extension I built a while ago for storing structured data on wiki and making it available from another wiki. Basically it was a refactoring of my Zero code, to allow this functionality to be shared. It was already used for data namespace for storing JSON (for a limited deployment), but I decided to make some more improvements before promoting it for a wider audience. JsonConfig is a misnomer - because it was initially thought to be for configurations rather any data, even though the code is actually identical. We could of course rename it, but I don't think it is worth the hassle.

ThurnerRupert added a subscriber: ThurnerRupert.EditedApr 20 2016, 5:10 PM

just to add another example which might help the zillion of sports results on wikipedia. taken an example of bayern munich:

brion added a subscriber: brion.Apr 20 2016, 6:47 PM

Couple quick notes:

  • pretty cool. :)
  • I worry about efficiency of storage and queries; for small tables json blobs are fine but for large data sets thisll get extremely verbose, and loading/saving small updates to a large table will get very slow. Consider providing query and update primitives that don't require the client (lua module, wiki API client, editor form, etc) to load and parse the entire blob. Eg "gimme the rows with key value in this range" or "update columns A B and C with these values in rows with key values X Y and Z". This could then be extended in future to a different storage backend, or a streaming json reader.
Yurik added a comment.Apr 21 2016, 2:53 AM

@brion, could you think of use cases for partial data reads? I think it will be mostly "draw data as a wiki list or a table with some magical highlighting/string concatenation/...", or "draw a graph with all the data". That's why at this point I simply provide Lua's mw.data.getData("page_name.tabular"), which returns the whole thing as JSON decoded into a Lua table structure. I see no reason not to provide mw.data.query("SELECT field1 FROM page_name WHERE field2 > 10"), but that's a whole other ball game :)

Also, adding JSON storage should provide an easy solution for lookups, e.g. mw.data.lookup("page_name.json", "top-level-field-value"), which could be very efficient for large but shallow jsons.

brion added a comment.Apr 21 2016, 3:07 AM

Pulling individual data items out of large lists; pulling relevant columns in order to sum them; pulling or updating a small number of cells during editing; sub setting a large data set to graph the subset; sub setting a large data set to perform operations on it Ina Lua module.

For a concrete example: say I have a data set that lists the voting breakdown during US primary elections for every county. That's roughly 3000 rows, each with data points for each candidate. Not too awful perhaps, so let's make a bigger table.

Population of every US census place for every 10-year census since 1790. That's probably a lot. Now add more columns for various breakdown information.

Subset queries on a large data set would allow for reading in data relevant to a map area, or a particular place, or a particular breakdown column, or a particular year range, or some combination of all these things, having had to pull in only the column list and do some sort of index scan on the key rows.

The alternative is to break down the giant table into lots of small tables, and have the client code in a Lua module or whatever 'stitch' together the multiple data sets when needing to cross boundaries.

You might say that no one should store that much data in one table, but I'm pretty sure people will push the limits of a system like this. :)

As I understand these are stored as regular MediaWiki pages now, so they have a maximum length of 2 MB. Even naive queries pulling the whole thing into memory would be fast enough at these scales. If we want to think about performance for large data, we should first think about overcoming the length limitation :)

brion added a comment.Apr 21 2016, 3:10 PM

As I understand these are stored as regular MediaWiki pages now, so they have a maximum length of 2 MB. Even naive queries pulling the whole thing into memory would be fast enough at these scales. If we want to think about performance for large data, we should first think about overcoming the length limitation :)

Ah I forgot all about that. ;) That'll at least stop people from creating _super_ huge datasets for now... unless they break them into multiple files and create lua modules to stitch them back together. :) That may be acceptable however, and lets people prototype the kinds of crazy things they want until we hate it so much we decide we have to support them better.

I tried generating some random tables with this PHP script: https://gist.github.com/brion/46469ac2df31a8eb0e179f50b1967d20

I find I can only successfully save a file of under a megabyte even though the max is 2 megs; a 2 meg file gets rejected complaining that it's over 4 megabytes... I notice the output that comes back to me in the edit window is pretty-printed, which means a lot of indentation and newlines bulking up the storage requirements. That might be on purpose to provide a better diff view?

Seems to run reasonably fast to render locally, though it also produces a giant HTML table for the page view that's about 3.5 megs, which should at least compress ok. Editing also means resubmitting the entire data set at once with the edit form.

A partial-edit submission system similar to section editing on wiki pages might be nice to reduce submission latency and bandwidth consumption editing a table, but that can probably wait until it needs to be paired with a spreadsheet-like UI.

brion added a comment.Apr 21 2016, 3:11 PM

Side note: headers are rejected if they contain spaces. That seems odd?

Population of every US census place for every 10-year census since 1790. That's probably a lot. Now add more columns for various breakdown information.

Argh. :-) The 'right' (but more complex) solution for that kind of problem is not to try and hack a multi-dimensional dataset into a 2D representation, but to represent it faithfully. However, AIUI that's outside the scope of this work — which I think is a pity. https://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/ is the work I nominally sponsored on this back in the day. It's actually an RDF encoding of the (non-SemWeb) https://sdmx.org/ standard (XML-based ISO standard, widely used in the stats community). More concretely, Excel's "PivotTables" feature is essentially mapping normalised nD data from a 2D format back into an (interactive) series of 2.5D cuts of said data. Ish.

Yurik added a subscriber: Eloy.Apr 22 2016, 3:21 PM

@brion, the pretty printing is for diffs - makes them much easier to digest. A while ago I also tried to make the actual storage use a non-pretty-printed version, but I guess I never finished that part.

@brion, re spaces in headers - I want header names to be identifiers - after all, the target audience is Lua scripts and Graphs - both are programming languages. It makes it much easier to use, because i can simply say datum.buildingHeight instead of datum["building height"] in vega or in Lua. @Eloy has made some suggestions on how to localize them, and I think we should go with the second option - "headers.i18n": [ { "en":"blah", "fr": "blah", ...}, ...] for header localization.

@matmarex, exactly - the page-size limitation restricts how these datasets are used. In a way, this is a much easier to implement and more flexible "multi-stream" proposal, that will allow editors to create shared lists and tables. The giant dataset storage with the proper query interface is a separate task all together. We want to have it, but it is a much bigger target with a much bigger resource requirements.

P.S. For some reason I don't get email notifications from phab anymore :(

brion added a comment.Apr 22 2016, 3:25 PM

Re headers -- yeah need to distinguish between header labels (i18nable text) and column ids (identifiers for programs). As long as capability is there I don't mind the terms used, sounds like you're already working on that :)

Yurik added a comment.Apr 22 2016, 5:01 PM

I'm totally ok to bikeshed about the naming:

  • for ID, it will be a list of strings named: "headers", "ids", "columns", "header_id", ...
  • for localized column name, it's a list of objects, each object having (language id -> string). We can call it "columns", "labels", "headers", ... ?

I'm totally ok to bikeshed about the naming:

  • for ID, it will be a list of strings named: "headers", "ids", "columns", "header_id", ...
  • for localized column name, it's a list of objects, each object having (language id -> string). We can call it "columns", "labels", "headers", ... ?

I think "column_id" and "columns" for consistency would make sense, but I don't care much. ;_)

brion added a comment.Apr 22 2016, 6:21 PM

@Yurik: The W3C CSV on the Web working group's metadata model recommendation refers to "columns" with attributes for "name" and "titles" (plural, allowing alternates or per-language variants), with similar recommended character restrictions on "name" for ease of programmatic use: https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dfn-column-description

So following the naming model would give us either separate "column_name" and "column_title" keys containing lists of strings/localized string map objs... or following the W3C data model a little more closely would give us a "column" key containing a list of objects with their own "name" and "title"(s?) props.

@Jdforrester-WMF data cube sounds awesome. :D someday later!

@Jdforrester-WMF data cube sounds awesome. :D someday later!

Yessir. :-)

Alkamid removed a subscriber: Alkamid.Apr 22 2016, 8:39 PM

More important semi-bikeshed questions:

  • How should we store licenses? Is there a license ID of any sorts? I wouldn't want free form license field text if possible.
  • Are there any other metadata fields required?
  • Should we support datetime fields in mediawiki datetime format? (I would love to re-use all the Wikidata data types, but apparently it is not easy to refactor them out)
TheDJ added a subscriber: TheDJ.Apr 22 2016, 11:38 PM

And how do you request deletion ?

One idea is to use a multipart contenthandler. The Page namespace of wikisource does this. That way, you can have a part wikitext and a part json. But also potentially adds to confusion

Can the talk pages be used for deletion requests?

On Commons, usually a subpage of "Commons:Deletion requests/Page under discussion" is used for a deletion request, with a tag being placed on the page being discussed. Same on English Wikipedia, save for different names.

Yurik added a comment.EditedApr 23 2016, 12:07 PM

@JEumerus, adding a tag to the structured content (json) is not as obvious as for free form wiki markup. We could interpret some meta fields as wiki markup...
So far this is the the structure I'm going for:

{
  "license": "licence-id",  // do we have these IDs in core somewhere?
  "info": {
    "lang-id": "description in that language",  // plain text, localized
    ...
  },
  "wikitext": "Any wiki markup, including [[category:my category]] and {{delete|me}}",  // better name?
  "headers": [ "header1", ... ], // non-localized
  "titles": [
    { "en": "header one", "ru": "первая колонка", ... },  // plain text, localized
    ...
  ],
  "types": [ "string", "localized", "number", "boolean" ],
  "rows": [ [...], [...], [...] ]
}

An extra feature could be "headersRef": "Some other table.tabular" instead of "headers" and "titles", allowing headers to be defined in another table. This way many identically structured tables can benefit from the shared localization.

@Yurik I like that -- maybe generalize it as a metadata inheritance model; anything not filled out in the local json is taken from the referenced .tabular item.

Side note -- the referenced source data: page should get recorded as a template link in the link tables maybe? Or a file link at least. Some kind of reference. :)

Name game: "inheritFrom", "deriveFrom", "ref", "link", "metadata", ...?
Still open question: where to get the license ids and their descriptions :)

I'm a fan of "inheritMetadata" :)

Yurik added a comment.Apr 23 2016, 4:51 PM

I found MediaWiki:Licenses - I am not sure what it is used for. There are translations in the subpages. Also, there is a list in preferences that I don't know what code controls. I don't want to create a separate licensing system for this, so looking for some ideas.

I found MediaWiki:Licenses - I am not sure what it is used for.

They're used for the licensing dropdown at Special:Upload (try loading the page with JS disabled, otherwise Commons-specific scripts will muddle it up).

Also, there is a list in preferences that I don't know what code controls.

This one is from UploadWizard (the whole preferences tab is specific to UW). It's controlled by $wgUploadWizardConfig['licensing'] (https://github.com/wikimedia/mediawiki-extensions-UploadWizard/blob/master/UploadWizard.config.php#L428).

I'm not sure if either of these is worth reusing. They were both created to match the already existing license templates at Commons. If you want machine-readable identifiers, try the SPDX ones (https://spdx.org/licenses/).

Yurik added a comment.Apr 25 2016, 4:48 PM

JsonConfig can easily allow us to store all allowed licenses as a "config" page - a JSON with a custom schema for licenses. This way we could have a wiki page named Config:Licenses.json (bikesheding is welcome):

{
  // License ID
  "CC0-1.0": {
    
    // Localized official name of the license. Do we need some 'short name'?
    "name": {
      "en": "Creative Commons Zero v1.0 Universal",
      ...
    },

    // Link to more information. Do we need multiple urls, e.g. FAQ vs Legal Code?
    "url": {
      "en": "http://creativecommons.org/publicdomain/zero/1.0/"
    },

    // https://commons.wikimedia.org/wiki/File:Cc-zero.svg
    "icon": "Cc-zero.svg",

    // https://commons.wikimedia.org/wiki/File:CC0_button.svg
    "button": "CC0_button.svg",

    // if has a value, this license will be shown at the top before all other licenses
    // The license with the lowest number will be shown as the default choice
    "prefered": 1
  },
  ...
}
Yurik added a comment.Apr 25 2016, 5:01 PM

Another alternative is to actually reuse .tabular for storing this data, instead of creating a custom format. Also, .tabular could benefit from "ordered" meta tag to automatically resort values on save. No sorting is allowed on the fields of "localized" data type.

"ordered": "id"
// or
"ordered": ["prefered", "id"]
Nuria moved this task from Operational Excellence Future to Radar on the Analytics board.

I removed T124569 because with T134426 this task could be marked as done (unless we want to keep it open for non-tabular - json/xml data)

Yurik moved this task from Backlog to Monitoring on the Commons-Datasets board.

@Yurik , would you be available to discuss this at the TechCom-RFC meeting on IRC this week? E213 has the details (the date and time: 2016-06-15, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST)).

Yep, sounds good.

Yurik renamed this task from Allow tabular datasets on Commons (or some similar central repository) (CSV, TSV, JSON, XML) to Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...).Jun 21 2016, 8:29 PM

@Yurik and all, I'm glad to see all this work going on, I was pointed to this after I made a comment on a wikidata property proposal that I thought would be best addressed by somehow allowing a tabular data value rather than a single value. However, I'm wondering if this might be best driven by specific problem cases rather than trying to tackle generic "data" records. One of the most common needs is for time-series data: population of a city vs time, for instance, economic data by point in time, physical data like temperature vs time, etc. The simplest extension beyond the single value allowed by wikidata would be to allow a set of pairs defined by two wikidata properties (eg. P585 - "point in time", P1082 - "population"). The relation to wikidata takes care of localization (those properties have labels in many different languages) and defines the value types (time and quantity in this case), and the dataset would somehow be a statement attached to a wikidata item (eg. a particular city) so that the item and pair of properties fully define the meaning of the collection of pairs. The underlying structure of the pairs doesn't really matter much. But there seems to be something missing here - I think it might be best addressed in wikidata itself...

Yurik added a subscriber: daniel.Sep 23 2016, 10:40 PM

@ArthurPSmith thanks, timeseries do represent a large portion of the common data needs. I have discussed reusing some of the Wikibase type system code with @daniel, but apparently their code is fairly Wikibase centric, especially in the front end, and it might be difficult to reuse. I certainly hope that at some point we will be able to do that. Until then, I am hoping to launch a generic tabular service with a limited subset of types, and later move on to integrate more of Wikibase's work into it.

Hmm, if that goes live where will the repository be?

Yurik added a comment.Nov 30 2016, 5:34 AM

@JEumerus sorry, did not see your response. The storage repository will be on commons as well, accessible from all other wikis. See community discussion here.

MarkTraceur moved this task from Untriaged to Tracking on the Multimedia board.Dec 5 2016, 9:59 PM
Yurik closed this task as Resolved.Dec 9 2016, 12:34 AM

Enabled in https://gerrit.wikimedia.org/r/#/c/326046/ - T148745

I think we should create new tasks for anything that is left to do. Otherwise this task is not actionable - it's just a wish list of many things, which could just as well be individual actionable tasks in the Commons-Datasets.