Page MenuHomePhabricator

Design a file format to represent Wikibase edits
Open, Needs TriagePublic

Description

In short: I would like to gather around a table tool writers to design a file format to represent candidate Wikibase edits. QuickStatements has been the de facto standard for this for some time, but there are a few problems with it:

  • due to its tabular nature, it is not so easy to extend it with new features. For instance there is interest in defining various strategies to merge added statements with existing ones, or setting ranks of statements.
  • again due to its tabular nature it is not so easy to parse or validate
  • the format is defined by the maintainers of the QuickStatements tool, whereas such a format could be used as a common language by many different tools. So it deserves coordination between the various stakeholders, a precise specification and tooling around it.

This format would be based on a richer encoding than CSV/TSV (such as JSON), and so it would not be designed to be easily produced by users with spreadsheet software or a text editor, but rather meant to be a communication medium between tools. Down the line, perhaps even Wikibase itself could accept this format to ingest edits?

Context: in the OpenRefine project we are planning to add support for editing structured metadata on Wikimedia Commons. As part of that we are considering representing the upload of a collection of media files and their structured metadata as one big archive (which contains both the media files and the metadata). This big archive could then be sent to a Toolforge tool which would perform the upload to Commons (just like QuickStatements but with support for uploading new media files and setting the corresponding wikitext for each file as well).

Event Timeline

I thought there was already a standard around some "diff" format like DoubleCheck uses between Mediawiki revision table rev_ids? I recall using Wikiloop DoubleCheck which has an interesting interface to expose a portion of an edit for judgement and rollback.
It probably makes sense to pull someone from their team or others into this conversation as well to explore ideas on Merge conflict resolution displaying and what formats lend themselves well to that?

Representing the difference between two entity states is related, but not quite the same thing. In a file format that represents candidate edits to be performed, you should not need to know about the current state of the items you are about to edit: you only provide the data you want to add / remove and some parameters to control how this data will be merged with existing data on the item.

For inspiration about 'what has been done before': @Husky has once developed QSML https://hay.toolforge.org/qsml/ - not fully what we want to do here but it goes in the direction.

When talking with tool writers, it may be really good to at least inform / consult with @Magnus and @LucasWerkmeister as QuickStatements contributors, and with @Yarl because of similarities with the Pattypan workflow.

I can help with producing and describing use cases. Perhaps @GiFontenelle knows of good use cases as well (e.g. existing or upcoming StructuredDataOnCommons GLAM uploads)?

When you say Wikibase do you mean Wikidata, Structured data on Commons or completely Wikibase in general? Judging from the context it's SDC so I would scope it on that. For the monuments database we define source fields, how they map to the destination and what kind of conversion to apply. You could do something like that.

I'd keep the scope to Wikibase in general because I feel the same need for Wikidata too, actually.

Another existing format that I suddenly thought about, for inspiration: ResourceSync

http://www.openarchives.org/rs/toc

I do think this is more of a protocol to keep information in distributed databases synchronized, rather than a format for piping data back and forth. I'm not really aware of any projects that actively use this - I'd be interested to hear about them.

If I remember correctly, @Abbe98 may have looked at this before, in the context of the metadata roundtripping research?

ResourceSync is indeed a protocol for informing other databases about one's contents or changes(similar scope as OAI-PMH) and as far as I know, it hasn't had many implementations.

An advantage of the QS format that should not be dismissed is that because it's tabular it's compatible with the output formats of most SPARQL endpoints. Having played with the idea of supporting it in Pattypan I find it rather easy to parse and because it's compatible with the output formats of SPARQL endpoints it opens up some very interesting workflows.

Thanks for participating in the Wikimedia Hackathon 2021! We hope you had a great time.

  • If this task was being worked on and resolved at the Hackathon: Please change the task status to "resolved" via the Add Action...Change Status dropdown.
  • If this task is still valid and should stay open: Please add another active project tag to this task, , so others can find thise task (as likely nobody in the future will look back at Wikimedia-Hackathon-2021 tasks when trying to find something they are interested in).
  • In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to "declined".

Thank you,
your Hackathon venue housekeeping service

For WikibaseStatementUpdater I (ab)used the QuickStatements format to mean "update existing statement if one (and only one) exists, otherwise add as a new".

I think you can be a bit bold here and just propose something. I can volunteer to add support for that format in my tool and give feedback based on that.

JSON sounds nice, as you can have a schema to validate against. I recommend to use jsonl (json lines) though, to enable per-record parsing.

For more inspiration: