Page MenuHomePhabricator

Add row/cell annotations to tabular data
Open, Needs TriagePublic

Description

Tabular data on commons supports marking the source of the data, but only as a single source for the entire table, which does not match how data tends to be sourced in real life - often every row, column or cell of a table has different references. (Examples: per-row, per-column, per-cell.) Although much rarer, other types of annotations (plain textual notes, not sourcing, e.g. comments on methodology) are also sometimes used, and also can be on row, column or cell.

If tabular data is ever to serve as the primary way of storing data (as opposed to an intermediary for data taken from external sources or parsed from a wiki table), it needs to be able to support cell-level references and notes. Ideally column/row level too, although one could hack around that by providing an extra column or row where the cells contain nothing but sources.

Event Timeline

At this point additional annotations could only be done as extra columns. This would work for many cases, but probably not all. Could you give some examples of where columns won't be enough, and the dedicated annotation system would be required?

The task description does have examples. But also, how would the extra column be rendered by a dedicated client (either an internal tabular data editing UI or something external that tries to display or convert it)? There is no "source" datatype, so it would have to be defined as plaintext but actually contain wikitext, or something ugly like that. Sources should be their own type, with their own data structure.

Tgr: Agreed on the importance of this. There should be a canonical space for annotations, which could be a source or other (imagine the entire output of your favorite {{cite}} or {{footnote}} template) -- just as though every cell has a pre-generated footnote for all such details. There is already a common norm of providing an extra source column (or row, more rarely) -- which would still be useful where there is no need to repeat the same source information for each cell.

Then the composite annotation for a cell could be (row note + column note + cell note). Cc @Thadguidry who had related thoughts on the pros and cons of storing this data as extra columns.

@Tgr i strongly oppose storing wiki markup inside columns because it makes the system far less portable and less stable. Wiki markup only works in the context of a specific wiki, and would render either differently or simply break -- templates, localization settings, and modules are wiki specific.

I think the best course of action is to introduce a new column type, with the same structure as what wikidata has for references. This way the designer of each table can decide if 1) such source column is actually needed, and 2) if there should be just one or multiple source columns. What's more, I think the top-level source field should have the same format for consistency -- this makes source be either 1) per table, 2) per row, or 3) per column (just for the relevant columns).

Hi all - My personal opinion and those of a few other experts would be to embrace DRY (Don't Repeat Yourself - or others) and simply allow introduction of W3C standards for Tabular Data:

  1. CSV on the Web is such a standard and discusses in its sections 1.1 through 2.4 exactly how you would store the information in files that can be uploaded to Commons. (2.4 in particular talks about cell metadata) It has metadata support at Table, Column, and Cell levels with any vocabulary including using Wikidata Entity URIs (which actually was a use case during the design of CSV on the Web, along with Schema.org vocab and others).
  2. Tools for editing, validating, and displaying can then surround this standard that is fully embraced by the Open Data Institute and other organizations already. This means we don't break many existing workflows around portable Tabular Data and allow sharing in a popular format while also being able to understand the data easily, including its support for different languages!
  3. As far as internal storage suggestions (an import or ingress of CSV files and Metadata files to support editing, validating, display, I think that should be left up to Wikidata engineers since various internal storage mechanisms afford performance tradeoffs when you introduce Querying Tabular Data. Storage should support storing files in the CSV on the Web standard. 1 data file, 1 metadata file.
  4. Non-Goals:
    • Transforming CSV and Metadata files into other formats. I think other tools support this easy enough for most users purposes.
    • Querying Data. I don't think that Commons has Querying support itself from what I saw thus far? and that's fine if so. Lots of other tools (locally and online) allow sourcing data from files at a URL to serve all users querying needs in their various domains.
  1. Future-Goals:

Do you need me to comment on UI display approaches for CSV on the Web metadata?

  • I think for cell metadata, just a simple (i) info circle button in the top right of any cell that has metdata can visually work very well. This would just open a popup box as handled now like reference citations, but instead show all fields of the metadata that was annotated for that cell, such as any Web Annotations. But if you want, you certainly could display the metadata fields as extra columns (different style or colored heading, etc.) next to the target cell column. Either approach works visually, and the 2 cases that often come up with metadata for any system are 1. based on a casual browsing need (display as info boxes or popup dialogs), or 2. a more analytical need of the metadata (display as optional extra columns).

Here's some UI tools that were developed during the course of CSV on the Web development: https://github.com/ODInfoBiz

CSV on the Web and related standards are actually really really good and well thought out. The standard is not the problem...its implementations using the standard and knowing about its richness afforded. It's actually a lightweight standard just as CSV itself is, which was the intent... to not get too much in the way of data publishers.