Design a file format to represent Wikibase edits
Open, Stalled, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Pintoch
	May 13 2021, 4:55 PM

Description

In short: I would like to gather around a table tool writers to design a file format to represent candidate Wikibase edits. QuickStatements has been the de facto standard for this for some time, but there are a few problems with it:

due to its tabular nature, it is not so easy to extend it with new features. For instance there is interest in defining various strategies to merge added statements with existing ones, or setting ranks of statements.
again due to its tabular nature it is not so easy to parse or validate
the format is defined by the maintainers of the QuickStatements tool, whereas such a format could be used as a common language by many different tools. So it deserves coordination between the various stakeholders, a precise specification and tooling around it.

This format would be based on a richer encoding than CSV/TSV (such as JSON), and so it would not be designed to be easily produced by users with spreadsheet software or a text editor, but rather meant to be a communication medium between tools. Down the line, perhaps even Wikibase itself could accept this format to ingest edits?

Context: in the OpenRefine project we are planning to add support for editing structured metadata on Wikimedia Commons. As part of that we are considering representing the upload of a collection of media files and their structured metadata as one big archive (which contains both the media files and the metadata). This big archive could then be sent to a Toolforge tool which would perform the upload to Commons (just like QuickStatements but with support for uploading new media files and setting the corresponding wikitext for each file as well).

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Spinster	T289971 [epic] Add Structured data on Wikimedia Commons support to OpenRefine
		Stalled		None	T282796 Design a file format to represent Wikibase edits

Event Timeline

Pintoch created this task.May 13 2021, 4:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 13 2021, 4:55 PM

Pintoch moved this task from Backlog to Projects on the Wikimedia-Hackathon-2021 board.May 13 2021, 4:55 PM

I thought there was already a standard around some "diff" format like DoubleCheck uses between Mediawiki revision table rev_ids? I recall using Wikiloop DoubleCheck which has an interesting interface to expose a portion of an edit for judgement and rollback.
It probably makes sense to pull someone from their team or others into this conversation as well to explore ideas on Merge conflict resolution displaying and what formats lend themselves well to that?

Antoine2711 subscribed.May 19 2021, 8:25 PM

Representing the difference between two entity states is related, but not quite the same thing. In a file format that represents candidate edits to be performed, you should not need to know about the current state of the items you are about to edit: you only provide the data you want to add / remove and some parameters to control how this data will be merged with existing data on the item.

For inspiration about 'what has been done before': @Husky has once developed QSML https://hay.toolforge.org/qsml/ - not fully what we want to do here but it goes in the direction.

Another piece of inspiration / stuff that others have already done: https://commons.wikimedia.org/wiki/Category:Data_ingestion_templates via @Multichill

When talking with tool writers, it may be really good to at least inform / consult with @Magnus and @LucasWerkmeister as QuickStatements contributors, and with @Yarl because of similarities with the Pattypan workflow.

I can help with producing and describing use cases. Perhaps @GiFontenelle knows of good use cases as well (e.g. existing or upcoming StructuredDataOnCommons GLAM uploads)?

GFontenelle_WMF subscribed.May 20 2021, 6:31 PM

Edgars2007 subscribed.May 20 2021, 9:03 PM

When you say Wikibase do you mean Wikidata, Structured data on Commons or completely Wikibase in general? Judging from the context it's SDC so I would scope it on that. For the monuments database we define source fields, how they map to the destination and what kind of conversion to apply. You could do something like that.

I'd keep the scope to Wikibase in general because I feel the same need for Wikidata too, actually.

Another existing format that I suddenly thought about, for inspiration: ResourceSync

http://www.openarchives.org/rs/toc

I do think this is more of a protocol to keep information in distributed databases synchronized, rather than a format for piping data back and forth. I'm not really aware of any projects that actively use this - I'd be interested to hear about them.

If I remember correctly, @Abbe98 may have looked at this before, in the context of the metadata roundtripping research?

ResourceSync is indeed a protocol for informing other databases about one's contents or changes(similar scope as OAI-PMH) and as far as I know, it hasn't had many implementations.

An advantage of the QS format that should not be dismissed is that because it's tabular it's compatible with the output formats of most SPARQL endpoints. Having played with the idea of supporting it in Pattypan I find it rather easy to parse and because it's compatible with the output formats of SPARQL endpoints it opens up some very interesting workflows.

Thanks for participating in the Wikimedia Hackathon 2021! We hope you had a great time.

If this task was being worked on and resolved at the Hackathon: Please change the task status to "resolved" via the Add Action... → Change Status dropdown.
If this task is still valid and should stay open: Please add another active project tag to this task, , so others can find thise task (as likely nobody in the future will look back at Wikimedia-Hackathon-2021 tasks when trying to find something they are interested in).
In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to "declined".

Thank you,
your Hackathon venue housekeeping service

Pintoch added a project: OpenRefine.May 24 2021, 12:14 PM

Pintoch removed a project: Wikimedia-Hackathon-2021.May 24 2021, 12:20 PM

Nikerabbit subscribed.May 25 2021, 12:59 PM

For WikibaseStatementUpdater I (ab)used the QuickStatements format to mean "update existing statement if one (and only one) exists, otherwise add as a new".

I think you can be a bit bold here and just propose something. I can volunteer to add support for that format in my tool and give feedback based on that.

JSON sounds nice, as you can have a schema to validate against. I recommend to use jsonl (json lines) though, to enable per-record parsing.

For more inspiration:

wikibase-edit already uses a JS/JSON representation of edits. It should be feature complete (supporting ranks, snaktypes, and all)
QuickStatements format can be converted to the wikibase-edit format via the lib quickstatements-to-wikibase-edit
There is also an ongoing experimentation to add a reconciliation syntax to the wikibase-edit format

This will be useful for T289971: [epic] Add Structured data on Wikimedia Commons support to OpenRefine, especially for the editing and upload functionalities that will be developed in that context.

Alicia_Fagerving_WMSE subscribed.Aug 31 2021, 10:58 AM

Spinster moved this task from Backlog to SDC-support Doing on the OpenRefine board.Nov 16 2021, 7:02 PM

@Pintoch is going to work on this in the upcoming time!

As a first step, we are collecting input from anyone who has ideas on what such a more expressive format should be able to do. We're asking for feedback from Wikibase users in the broadest sense, and it can be added in this document:

https://docs.google.com/document/d/1z-UNIyd7EHedlPRlQsleHU402ZHSA7Xi789BNXdfbD4/edit

Pinging @Nikerabbit and @Maxlath specifically because you already provided some pointers and interest earlier, and of course anyone is welcome to chime in.

Preferably we want to collect this input asap so that we can get this train running.

RShigapov subscribed.Nov 18 2021, 1:58 PM

I came here after reading the google doc at https://docs.google.com/document/d/1z-UNIyd7EHedlPRlQsleHU402ZHSA7Xi789BNXdfbD4/edit?pli=1#

I'm going to put this here as slightly related https://github.com/wmde/WikibaseReconcileEdit

This was a prototype API build around some of the topics that are identified in the document, such as "Order of action for related items"

We haven't had action on this topic for a while, so changing status. I think it's still relevant?

Spinster moved this task from SDC-support Doing to 🥘 Backburner on the OpenRefine board.Jan 24 2023, 1:32 PM

Just for posterity, I'd like to mention my own wikibase diff engine in Rust: https://gitlab.com/tobias47n9e/wikibase_rs/-/blob/master/src/entity_diff.rs

Alicia_Fagerving_WMSE unsubscribed.Jan 27 2023, 1:17 PM

Addshore unsubscribed.Jun 27 2023, 11:47 AM

Design a file format to represent Wikibase editsOpen, Stalled, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Design a file format to represent Wikibase edits
Open, Stalled, Needs TriagePublic
Actions

Related Objects
Search...