Provide intraline diff format in API action=compare
Open, NormalPublic

Description

Currently it is possible to request html source code of diff, it would be far more useful if diff could be retrieved in a way, that for example JSON parsers can understand.

The HTML diff formatter already does intraline processing but is currently hardcoded in the PHP C-extension to produce HTML directly. We can re-implement this as a DiffFormatter subclass and expose it in the API as one of the diff formats.

Details

Reference
bz54328
bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz54328.
bzimport added a subscriber: Unknown Object (MLST).
Petrb created this task.Sep 19 2013, 2:59 PM

The data structure would have to be rather more complicated than that. At first guess, something along the lines of (in JSON):

"diff": [

{ "line": 1, "type": "context", "content": "Line" },
{ "line": 2, "type": "removed", "old": "Line" },
{ "line": 2, "type": "added", "new": "Line" },
{ "line": 3, "type": "context", "content": "Line" },
{ "line": 47, "type": "context", "content": "Line" },
{ "line": 48, "type": "changed", "old": "Line", "new": "Line" },
{ "line": 49, "type": "context", "content": "Line" }

]

If you want indication in the line of what changed for "changed" types, that's another complication. Instead of just "Line" it would have to be an array of fragments. One simple way might be that even array indexes are unchanged and odd are changed:

"old": [
    "foo bar ",
    "",
    "quux ",
    "poop",
],
"new": [
    "foo bar ",
    "baz ",
    "quux ",
    "etc.",
]

That might indicate that "baz" was inserted into the list and "poop" at the end was replaced with "etc.". Or maybe it would be better to combine "old" and "new" into one datastructure somehow.

Also, keep in mind that lots of little objects can use a surprising amount of memory (see bug 53663).

Petrb added a comment.Oct 17 2013, 1:53 PM

I think that for beginning splitting new text and old text would be enough, right now it's hard to find out what was added by user and what was there before they edited the page

  1. Character vs. line offset

I'd much rather represent diffs based on a character offset I'm afraid of representing position with something like lineno since linebreaks are differently defined between systems. Character offsets would also allow us to make changes to our diff detection strategy without changing the output.

  1. Machine readable vs. human readable diffs

Machine readable diff opcode formats tend to represent the full set of operations used to recreate a revision -- not just the context. A common format that I'm familiar with would something like this:

a = "These are wrd."
b = "These are words."
{

diff: [
  {
    op: "equal",
    a_start: 0,
    a_end: 10,
    b_start: 0
    b_end: 10
  },
  {
    op: "remove",
    a_start: 10,
    a_end: 13,
    b_start: 10,
    b_end: 10,
    content: "wrd",
  },
  {
    op: "insert",
    a_start: 13,
    a_end: 13,
    b_start: 10,
    b_end: 15,
    content: "words",
  },
  {
    op: "equal",
    a_start: 13,
    a_end: 14,
    b_start: 15,
    b_end: 16
  }
]

}

  1. compressed format:

I don't see the value in compressing the format given that the API doesn't really let you query for more than one diff at a time and diffs tend to be represented in few operations. However, we could simply represent each operation as a tuple with agreed upon field order:

{
  op: "insert",
  a_start: 13,
  a_end: 13,
  b_start: 15,
  b_end: 18,
  content: "foo"
}

could be

[
  "insert",
  13,
  13,
  15,
  18,
  "foo"
]

or if we really want to get a tight format (since the rest of the fields are derivable in a sequence of operations).

[
  "insert",
  15,
  "foo"
]

(In reply to Aaron Halfaker from comment #3)

  1. Character vs. line offset I'd much rather represent diffs based on a character offset I'm afraid of representing position with something like lineno since linebreaks are differently defined between systems.

Isn't that an argument for line-based rather than chatacter-based offsets?

Character offsets would also allow us

to make changes to our diff detection strategy without changing the output.

  1. Machine readable vs. human readable diffs Machine readable diff opcode formats tend to represent the full set of operations used to recreate a revision -- not just the context.

OTOH, what is the usual use of querying the diffs? I suspect it's more often that the client is wanting to display a human-readable diff to the end user than because the client is wanting to do the equivalent of the 'patch' utility on an already-downloaded local copy of the article.

and diffs tend to be represented in few operations.

On talk pages, maybe. But someone heavily copyediting an article is likely to generate a huge number of operations. With the way the diff algorithm works, even some simple edits will generate many operations as it tries to match up individual letters in the old vs new paragraphs.

Petrb added a comment.Feb 12 2015, 1:55 PM

I talked with Tom Morris recently and I believe that this feature would make it easier to provide meaningful diffs on wikidata as well which don't use classic content model

Jaeol awarded a token.Apr 15 2015, 6:40 AM
Jaeol added a subscriber: Jaeol.

Exposing the array or unified format in the API would be useful. The array seems more useful, but the unified string format has the advantage of being a defacto standard that lots of application will know how to deal with.

We currently already have a DiffFormatter interface, implemented as TableDiffFormatter, ArrayDiffFormatter, and UnifiedDiffFormatter.
Considering these already exist and take the same Diff object, it looks like it wouldn't be very complicated at all to expose these.

Krinkle renamed this task from Make it possible for edit diff to be provided as a raw text to API should provide revision diffs in machine readable format.Jun 12 2015, 1:22 AM
Krinkle set Security to None.
Krinkle removed a subscriber: wikibugs-l-list.
Krinkle added a subscriber: GWicke.Jun 12 2015, 1:33 AM

Following conversation with @Halfak at the Lyon Hackathon 2015, we agreed the unified format isn't all that useful. We can consider exposing it if there's actual interest expressed by users, but based on the needs by ClueBot, Huggle, ORES, and other tools, an array format would avoid re-inventing diff logic everywhere.

But the existing array format isn't very useful, and actually has some logic errors on it (probably because it was never tested). Ideally we'd come up with an array format that exposes intraline diffs (word-level diffs).

MediaWiki already has logic for this in the HTML formatter, but that's not currently written in a re-usable way. The format of https://phabricator.wikimedia.org/T56328#571222 seems reasonable. @GWicke also has ideas for this.

Once implemented, we can expose it through ApiRevisionQuery for starters. See T100082 for more information on how we can make those diffs available for high-volume consumers.

jayvdb added a subscriber: jayvdb.Jun 17 2015, 2:42 AM
Krinkle renamed this task from API should provide revision diffs in machine readable format to Provide intraline diff format in revision query API.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2015, 10:45 PM
Krinkle updated the task description. (Show Details)Nov 2 2015, 10:47 PM
Catrope removed a subscriber: Catrope.Nov 13 2015, 7:43 PM

I would fine some kind of inline diff output from the API very useful. I've seen excellent response from new users to the Special:MobileDiff version of diffs, compared to traditional diffs: they use the same design concept as 'track changes' in word processor documents, so users new to wikitext don't get as hung up on all the layout and code... especially if what they are looking at involves primarily changes of textual content.

I want to build a diff viewer (for dashboard.wikiedu.org), and I was sad to discover that inline diff html is not available right now except by scraping Special:MobileDiff... which may be the approach I have to take for now.

JustBerry added a subscriber: JustBerry.

Related task: T157177

Task description: "There does not seem to be any way to get diff information in any other format than rendered HTML (which, beyond being ugly as hell, is unsuitable for analysis, conflict resolution and any number of other things one might want diff data for). There should be a prop=revisions flag that makes it output diffs in JSON or some other machine-readable format."

"Example of such an API call:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=revisions&titles=Apple&rvprop=content%7Cids%7Cuser%7Ccomment&rvlimit=max&rvdiffto=prev"

"Parsing HTML certainly does lead to a plethora of errors."

"diff": [
    { "line": 1, "type": "context", "content": "Line" },
    { "line": 2, "type": "removed", "old": "Line" },
    { "line": 2, "type": "added", "new": "Line" },
    { "line": 3, "type": "context", "content": "Line" },
    { "line": 47, "type": "context", "content": "Line" },
    { "line": 48, "type": "changed", "old": "Line", "new": "Line" },
    { "line": 49, "type": "context", "content": "Line" }
]

Something like @Anomie's initial sketch seems ideal.

Something to consider: On Wikidata, diffs for items don't have lines.

Jaeol removed a subscriber: Jaeol.Feb 6 2017, 10:24 AM

@Yair_rand How might we use Wikidata's diff structure in the API for Wikipedia? It's often easier to list changes on Wikidata because changes are made to specific properties/fields, rather than to bodies of text.

Anomie renamed this task from Provide intraline diff format in revision query API to Provide intraline diff format in API action=compare.May 4 2017, 7:32 PM
Anomie updated the task description. (Show Details)

Per T164106: Deprecate parsing and diff options in ApiQueryRevisionsBase, diffing in prop=revisions is deprecated. This could still be done in action=compare.