Page MenuHomePhabricator

Expose structured diffs in Wikibase API
Open, Needs TriagePublic

Description

As an API user, I would like to be able to access diffs between revisions of items (or properties, lexemes). Currently, the API only exposes the HTML rendition of these diffs, via the compare action:
https://www.wikidata.org/w/api.php?action=help&modules=compare

This returns HTML like this:

<tr><td colspan="2" class="diff-lineno">aliases / en / 0</td><td colspan="2" class="diff-lineno">aliases / en / 0</td></tr><tr><td colspan="2">&nbsp;</td><td class="diff-marker">+</td><td class="diff-addedline"><div><ins class="diffchange diffchange-inline">PF3D7_0720400</ins></div></td></tr><tr><td colspan="2" class="diff-lineno">aliases / en / 1</td><td colspan="2" class="diff-lineno">aliases / en / 1</td></tr><tr><td class="diff-marker">-</td><td class="diff-deletedline"><div><del class="diffchange diffchange-inline">PF07_0085</del></div></td></tr>
<!-- diff cache key wikidatawiki:diff:wikidiff2:1.12:old-888082520:rev-888696170:1.7.3:25:lang-en -->

It would be great to have a JSON representation of these diffs as well. Currently, I can parse the HTML to extract the information I need, but this is brittle as any change in diff rendering could potentially break my consumer.

There is already a task T56328 about this in MediaWiki, but given that there is an even stronger case for this in Wikidata, I am creating a specific ticket for it. This could potentially be implemented by a dedicated API action given the different data format.

I do not have a particular JSON format to propose: anything that is easy to produce by the underlying PHP implementation would be great to have.

Event Timeline

Pintoch created this task.Mar 20 2019, 1:31 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptMar 20 2019, 1:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Pintoch updated the task description. (Show Details)Mar 20 2019, 1:34 PM
Pintoch renamed this task from Exposed structured diffs in Wikibase API to Expose structured diffs in Wikibase API.Mar 20 2019, 1:37 PM

Can you write a short bit about what you'd do with the API/the use case?

@Lydia_Pintscher Sure! I have just deployed a demonstration of this use case on EditGroups.

The goal is to index all Wikidata edit groups by the properties that they change (as statements or qualifiers, added or removed by the edit group). This can be useful to get an overview of the data imports that happened in a particular domain. For instance, the list of all edit groups using P50 (author) gives you an overview of the ongoing author disambiguation effort in Wikicite.

Some editing actions (such as wbeditentity-create or wbeditentity-update) allow users to perform big changes on items, and therefore do not expose the properties used in the edit summary. So, to make sure these edits are also indexed, we need to analyze their diffs. Hence the need for this API.

One alternative would be to retrieve the JSON representation of the entities for all revisions involved and compute the diff manually, client-side. This is potentially more scalable as we could retrieve 25 diffs in one API call (25*2 revisions). But implementing the diff logic is probably a bit more involved.

Relevant code:
https://github.com/Wikidata/editgroups/blob/master/tagging/diffinspector.py

Thanks a lot! That makes it clear.
@Addshore Care to chime in?

Phaebz added a subscriber: Phaebz.Sat, May 18, 8:54 AM
Tobias1984 added a comment.EditedMon, May 20, 6:19 PM

I just stumbled over a use case that woud perhaps also benefit from structured diffs.

I made this small tool to view the Wikidata edit stream:

https://tobias47n9e.gitlab.io/wikidata-stream/

From the edit event I now fetch the current and previous revision, and then fetch both items in order to compare the changes in a structured form. This is quick and easy enough for me to work.

I see issues with the structured diffs especially with the deeply nested statements. Plus the data-structure would need to handle additions, modifications and deletions, which would probably make the JSON big. This would further increase the size of the edit event JSON object.

At the moment I can't decide if this is needed or if most use cases anyway need to fetch both revisions and handle the structured diffs client-side.