Page MenuHomePhabricator

Review shared data namespace (tabular data) implementation
Closed, ResolvedPublic

Description

Please review implementation details of the cross-wiki sharable data namespace implementation for tabular data:

Event Timeline

Surely this doesn't need all these tags and all these people. Please pay attention when creating subtasks :(

My apologies to anyone who is not interested in the implementation review of this feature - please unsubscribe. I am hoping to deploy this fairly soon, as there is clearly a huge demand for this functionality, so any feedback would be great. Thanks!

Please use class="wikitable" as default formatting.

Please use .tab extension instead of .tabular (more language independent).

I'm missing the way how to simply copy or paste in CSV/TSV.

T120452: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...) actually mentions it should be uploadable...
There should be also way how to export it, yet preferably in selected format.

Please use class="wikitable" as default formatting.

Already done

Please use .tab extension instead of .tabular (more language independent).

Agree, .tab might sound better. As long as its not .tsv which has a very specific meaning.

I'm missing the way how to simply paste in CSV/TSV.

That's a UI todo - we need a much better editor that supports tsv/csv import/export, to be able to edit using a spreadsheet-like interface, and also to be able to edit directly from another wiki, without visiting commons (just like you can add an interwiki link without going to Wikidata). A download link for csv/tsv is also an option.

Please use class="wikitable" as default formatting.

Already done

I was pretty sure I demanded that already at hackathon ;-), but the demo page still shows atypical formating... (Update?)

I'm missing the way how to simply paste in CSV/TSV.

That's a UI todo - we need a much better editor that supports tsv/csv import/export, to be able to edit using a spreadsheet-like interface, and also to be able to edit directly from another wiki, without visiting commons (just like you can add an interwiki link without going to Wikidata). A download link for csv/tsv is also an option.

Is this (and other stuff) going to be tracked under JsonConfig or separate (new) project? (= Under which would you like people to report tasks?)

Ah, sorry, that's for transclusions that will only work on the same wiki. So are you proposing the whole table use wikitable style without any additional styles for the cells? or green cells are ok?

I will create a new project with all related tasks. Waiting for my approval to be able to create new projects myself - I seem to be doing it too often. Will post it here once its created.

Ah, sorry, that's for transclusions that will only work on the same wiki. So are you proposing the whole table use wikitable style without any additional styles for the cells? or green cells are ok?

Use class="wikitable sortable datatable" (choose other name if you dislike datatable).
It will add built-in wikitable formatting, sortability and class to be used for overriding wikitable settings. You may add some default styling to "datatable", i.e. monospaced font for table cells, however, I would discourage the green background.

This is a really cool idea, Yuri! Done correctly, this has the potential for establishing a de facto standard for how one represents tabular data in JSON.

I'm assuming we're not the first people in the world that have thought of representing tabular data as JSON. Just based on my very quick set of searches, here's what I found:

That's not to say anything qualitative about those alternatives relative to the format you propose; just my first impression of "stuff I should read" based on a quick skim of this. Could you tell me more about your research?

The good news (for this) is that ".tab" doesn't have a mediatype associated with it in the Apache mime.types file (which is always the first place I look when I'm creating a new file extension). The bad news is that the enwiki list of file formats does show that ".tab" is currently associated with MapInfo files. I don't think there's any widely-recognized, open, official registry for file extensions (is there?), and your proposal seems to have a more intuitive claim to ".tab".

It would be neat to enlist the people who are writing specs in this area to help define the spec for our implementation. Is that something you can do, Yuri?

@RobLa-WMF, thanks for putting all the links together. I have seen some of them, and will email the authors.

Row objects vs arrays

I feel that other formats are optimized towards parsing and processing, rather than storage and ease of use. Repeating column names for each row seems overly verbose, and, in case of raw-text editing, makes it less readable and harder not to make mistakes. On the other hand, their system benefits sparse tables - missing values may simply be omitted, whereas in my proposal one would have to use "null". Also, for some use cases like graphs, parsing would require an extra step - converting arrays into objects on the client (could be done with auto-generated code for higher efficiency).

"rows": [
  {"column1":"foo", "column2":42, "optColumn3": true},   // rows as objects
  {"column1":"bar", "column2":13},
]
"columns": ["column1", "column2", "optColumn3"],
"rows": [
  ["foo", 42, true],   // rows as arrays
  ["bar", 13, null]
]
Target audience

Other formats, especially W3C, seem to only target machine-generation/consumption, and thus become extremely verbose and complex. They only need very few developers to fully understand and implement them. While I hope that we will quickly introduce good data editing tools, I think our format needs to be widely understood on its own by the same community that uses wiki markup, so that it can be easily consumed by developers of Lua modules and Graphs.

Data types, constrains, ...

I do not have very strong feelings either way about how to declare data types. I foresee the basic data types like number,string,bool, plus some more exotic ones like multi-lingual string (one string per language). Hard to define types like Wikidata ID, datetime, and URL could be stored as a string until we can reuse Wikidata's type system. Another needed feature is to reuse metadata from another page, thus allowing multiple pages to have the same structure.

RobLa-WMF mentioned this in Unknown Object (Event).Jun 6 2016, 6:46 AM

Another needed feature is to reuse metadata from another page, thus allowing multiple pages to have the same structure.

We should brainstorm this external schema idea, set up those weekly meetings I mentioned :)

@Milimetric, lets do it online - most interested parties are all over the globe, hard to pick a time.

works for me, brainstorm "meetings" could be new tasks in Commons-Datasets?

Hi, I'm one of the authors of JSON Table Schema and of the Data Package family of specifications (I was also an author on the W3C spec which was originally based on Data Packages but diverged in complexity substantially ...).

I've responded to @Yurik's question that he posted on the Data Protocols issue tracker:

https://github.com/dataprotocols/dataprotocols/issues/265

To echo what I said there ...


Great to hear about this. I would encourage taking a good look at:

JSON Table Schema: http://dataprotocols.org/json-table-schema/

Its very simple and already does exactly what you want :-)

I'd definitely recommend adopting that model for describing headers and types rather than rolling your own if you could.

In addition for the overall structure you could adopt the "resource" part of a (Tabular) Data Package model

What this would look like

Here's your example from your proposal redone using this approach.

As you can see it's just as simple and if anything a bit more expressive -- plus you can leverage all the work that has already been done developing these (and the tooling)!

Note: the data could be simple array of arrays rather than array of objects if that were preferred (it is more concise but is a it less "json-ic"). I've done a 2nd example showing that ...

{
    "title": "Some good fruites for you",
    "title@es": "Algunas buenas frutas para ti"
    "schema": {
       "fields": [
          {
            "name": "label",
            "type": "string",
          },
          {
            "name": "value",
            "type":  "number"
          },
          {
            "name": "stored",
            "type": "boolean"
          },
          {
            "name": "localName",
            // would suggest changing this type to "object"
            "type": "localized"
          }
      ]
    ],
    "data": [
      [
        {
          "label": "peaches",
          "value": 100,
          "stored": true,
          "localized": {
            "en": "in english",
            "es": "esto puede estar en español",
            "fr": "this could be in french"
          }
        }
      ],
      [
        {
          "label": "plums",
          "value": 32,
          "stored": false,
          "localized": {
            "en": "in english",
            "es": "esto también está en español",
            "fr": "this is also in french",
            "gr": "this could be in greek"
          }
        }
      ],
        ...
    ]
}

Example with row data as arrays rather than objects

{
    "title": "Some good fruites for you",
    "title@es": "Algunas buenas frutas para ti"
    "schema": {
       "fields": [
          {
            "name": "label",
            "type": "string",
          },
          {
            "name": "value",
            "type":  "number"
          },
          {
            "name": "stored",
            "type": "boolean"
          },
          {
            "name": "localName",
            "type": "localized"
          }
      ]
    ],
    "data": [
      [
        [
            "peaches",
            100,
            true,
            {
                "en": "in english",
                "es": "esto puede estar en español",
                "fr": "this could be in french"
            }
        ],
        [
            "plums",
            32,
            false,
            {
                "en": "in english",
                "es": "esto también está en español",
                "fr": "this is also in french",
                "gr": "this could be in greek"
            }
        ],
        ...
    ]
}
  • I keep wondering if we can use Wikidata/Wikibase more for this. Wikidata IDs for the license might be much more convenient for computer consumption, even though "CC BY 2.5" is by far more readable than Q18810333, but it allows us not to maintain a separate database of licenses, or deal with translations.
  • Wikibase may be a good future metadata store, once it gets deployed on Commons. It will simplify localized storage of title/description, plus other arbitrary metadata/tags/categories/...
  • Until Wikibase is ready, we should support absolute minimum - title (localized), description (localized), source (string). Source will be freetext notes area for now, because we are not yet sure what's needed, and it will have to be reworked later. We could even name it notes as to highlight this fact.
  • Do we want to support multiple datasets, possibly with different schemas, per page? I can certainly think of use cases, but would this overcomplicate things? I guess we could say that if page contains "resources" key, it's a data package, and otherwise its a simple one table per page page. Resource support won't be implemented at first.
  • Schema - calling them name and type is fine, but we should also have localized label:
"schema": {
  "fields": [ {
      "name": "fieldId",
      "type": "string",
      "label": { "en": "Field in English", "es": "field is Spanish", ... }
    }, ...
]}

For external schema, we can reference a different page. .tabschema format is TBD (@rufuspollock ?)

"schema": "Data:Shared Schema.tabschema"

@rufuspollock thanks for your feedback! I did not realize your schema supported array of array format. Btw, did you make a typo in "data" - being array of array of array instead of array of array? (both examples). What do you think about referencing Wikidata? I am not sure i like the "title@es" format. It implies that there is one "main" title, and other title that are less so. Considering that this data could be just for some small language wiki, having a mandatory "title" looses its meaning. Having a "localized" type with a dictionary allows any languages with an automatic fallback into whatever the reader may understand.

CC @Lydia_Pintscher

Hi,

I'm also working on the JSON Table Schema specification (and supporting tooling). It seems quite well suited here, and very happy to see how we can use it in this context.

I'm a little lost here. Is the idea that only data that can be structured as rows and columns ("fit" into a table) will be supported? Will nested key/value pairs be supported as the contents of an individual table cell?

I'm not sure a .tab file extension is needed. I actually thought T120452: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...) was about storing XML, CSV, TSV, JSON, etc. in wiki pages, but looking at http://data.wmflabs.org/w/index.php?title=Data:Sample.tab&action=edit I'm a lot less sure now.

The discussion about data types and constraints in this task makes me worry that we're slowly inventing yet another database engine when we already have options such as SQLite.

I'm a big advocate for using wiki pages for many things. I think ContentHandler is a big step forward. But using wiki pages as mini databases still seems more novel and cute than practical and sustainable. A few years ago, Wiktionary was quick to adopt Scribunto/Lua in order to create mini-databases of large dictionaries/arrays in wiki pages.

What are the storage considerations/implications for Wikimedia wikis here? Every time an edit is made to a table cell, we'd then be saving a full copy of the page? Will users download and manipulate up to 2 MB of text in a textarea, or even heavier, an enhanced textarea featuring syntax highlighting?

There are very valid and important reasons that we have pagination, offsets, and limits with data sets. Will these three features be supported with wiki pages?

There's also a real concern that we'll be immediately setting ourselves up for medium-term future problems (e.g., storing more than 2 MB) as we scale up and expand this type of wiki page-based data storage implementation.

I'm a little lost here. Is the idea that only data that can be structured as rows and columns ("fit" into a table) will be supported? Will nested key/value pairs be supported as the contents of an individual table cell?

@MZMcBride, while generic JSON content handler could support nested data structure, the whole idea behind tabular content handler is to provide a simple tabular format with each cell being a single value, with the exception of multi-lingual string values, which store key (lang code) => string objects. This should cover the vast majority of the usecases -- in-article tables, lists, data for graphs.

I'm not sure a .tab file extension is needed. I actually thought T120452: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...) was about storing XML, CSV, TSV, JSON, etc. in wiki pages, but looking at http://data.wmflabs.org/w/index.php?title=Data:Sample.tab&action=edit I'm a lot less sure now.

While we don't have to use .tab, it would help because we won't need to create a new namespace for each new data type, reusing Data namespace instead. Namespace prolifiration has been a constant complaint by many users. Also, I do not want to support multiple storage types, especially the notoriously bad CSV/TSV. Better provide an easy import/export functionality for them. Another type that has already been implemented and is undergoing some discussion is storing .geojson - map overlays. We could eventually introduce .json, but we have to be very clear what usecases it will solve.

The discussion about data types and constraints in this task makes me worry that we're slowly inventing yet another database engine when we already have options such as SQLite.

While it would be awesome to provide a large custom database support, the current proposal is limited to small tables, such as replacing the lists and tables we already have in many articles with a cross-wiki sharable, structured, Lua and graph accessible system. Which means it will not have any SQL-like functionality such as sorting/filtering via API, but rather allow Lua modules or Graph extension to read the table as a whole and process it as needed.

What are the storage considerations/implications for Wikimedia wikis here? Every time an edit is made to a table cell, we'd then be saving a full copy of the page? Will users download and manipulate up to 2 MB of text in a textarea, or even heavier, an enhanced textarea featuring syntax highlighting?

Currently, when users update a table in an article (even one cell), a full save is made. For now, this feature will follow the same model of multiple edits + one save action. Much further in the future we may introduce a more powerful backend (sqlite/...) that would handle per-cell edits, but clearly this would be by far more involved. I do not think there will be such a massive increase in data pagees as compared to the regular wiki articles, especially since data will be usable by multiple wikis.

There are very valid and important reasons that we have pagination, offsets, and limits with data sets. Will these three features be supported with wiki pages?

No. Small datasets <2mb only.

There's also a real concern that we'll be immediately setting ourselves up for medium-term future problems (e.g., storing more than 2 MB) as we scale up and expand this type of wiki page-based data storage implementation.

Sure, but larger data sets is a very different problem to have. So far, all lists and tables have been stored as wiki pages, for which 2MB was enough. Some day I hope we can support arbitrary external data, where users would set up comunity currated external URLs, and we would automatically create a data mirror and expose it to the world.

Getting ready for deployment: here's tabular data example on beta cluster, that also supports localization, shared data, and can be used directly from the graphs or from Lua scripts on any wiki. Note that the graph itself is in English wiki (labs), but data comes from Commons. Feel free to add translations. English, Russian

Related: Localizable maps data (GeoJSON), stored on Commons, and usable from multiple wikis - https://en.wikipedia.beta.wmflabs.org/wiki/Maplink-page