Provide intraline diff format in API action=compare
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Petrb
	Sep 19 2013, 2:59 PM

Description

Currently it is possible to request html source code of diff, it would be far more useful if diff could be retrieved in a way, that for example JSON parsers can understand.

The HTML diff formatter already does intraline processing but is currently hardcoded in the PHP C-extension to produce HTML directly. We can re-implement this as a DiffFormatter subclass and expose it in the API as one of the diff formats.

Details

Reference: bz54328

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Halfak	T252280 Improve Wikilabels UI
Open		None	T105518 Make Wiki Labels mobile compatible
Open		None	T104072 Implement single column diff option (from mobile app)
Open		None	T94370 Implement a mechanism that recognizes syntax errors of wikicode as a Huggle extension
Resolved		Petrb	T57793 Scoring should perhaps take into account added text rather than all text
Open		None	T100082 Provide useful diffs to high-volume consumers of recent changes
Open		None	T56328 Provide intraline diff format in API action=compare
Resolved		None	T117279 [EPIC] Core should provide inline diffs as well as side by side (Move InlineDifferenceEngine into core / remove MobileDiff)
Resolved		ovasileva	T101796 MobileDiff appears strangely empty if previous revision is hidden
Resolved		Jdlrobson	T222562 Special:MobileDiff doesn't correctly handle diff=0 if the diff would cover multiple revisions
Resolved		Ammarpad	T224430 Improve appearance of Special:MobileDiff when looking for invalid revision (Maybe because it's deleted)
Open		None	T194830 Refactor DifferenceEngine
Resolved		Jdlrobson	T238331 Option for hiding changes should be added to AMC diff display
Resolved		Jdlrobson	T240608 Standardise a control for switching diff types from side by side to inline/visual
Resolved		phuedx	T240622 [Technical debt payoff] Remove InlineDiffFormatter and InlineDifferenceEngine from MobileFrontend
Resolved		Jdlrobson	T240624 Style desktop Minerva diff page to look like Special:MobileDiff
Resolved		ovasileva	T242310 Regression: issues with MobileDiff
Resolved		• marcella	T242351 HTML for VisualDiff mode button appears in mobile diff pages (disabled by hack)
Resolved		matmarex	T244673 VisualDiff feature on mobile has an inconsistent style/layout
Resolved	BUG REPORT	Esanders	T245172 VisualDiff toggle buttons appear in mobile when the relevant page doesn't have wikitext content model
Resolved		ovasileva	T243783 MobileDiff drops whitespaces from edits
Resolved		ovasileva	T243235 Regression: Desktop diff styles for moved paragraphs load alongside mobile
Resolved		Jdlrobson	T171726 MobileDiff with inconsistent linebreaks within words
Invalid		None	T199307 Determine treatment for empty lines in diffs
Resolved		• jkroll	T197729 wikidiff2 creating ins and del elements with single empty character element
Open	Feature	None	T298174 Diffs should show change tags at lower resolutions
Open		None	T218428 Mobile diff is not clear about move direction
Resolved		Jdlrobson	T191706 It's not possible to undo/rollback edits from diff on Mobile
Resolved		Jdlrobson	T347779 Show editor groups and edit count on diff pages
Resolved		Jdlrobson	T347780 Include previous and next links outside the diff columns
Declined		None	T348179 Show bytes added on diff pages
Resolved		None	T350181 Enable desktop diff page on mobile site
Resolved	BUG REPORT	SBisson	T353404 Wikistory showing on mobile diff page
Resolved		Jdlrobson	T353407 Diff footer is overlaying the text when visual diff is clicked
Resolved	BUG REPORT	Jdlrobson	T353478 Diff tables collapse incorrectly on tablet breakpoint
Resolved		JSengupta-WMF	T353479 Diff legend does not display in mobile HTML
Open	BUG REPORT	None	T361176 visualdiff legend colors are mismatched
Open	BUG REPORT	None	T353062 Core diff on mobile buggy codex thank button
Resolved		Jdlrobson	T357079 Wrong digit on bnwiki mobile diff
Duplicate		None	T218779 Expose structured diffs in Wikibase API

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:01 AM

• bzimport added a project: MediaWiki-Action-API.

• bzimport set Reference to bz54328.

• bzimport added a subscriber: Unknown Object (MLST).

Petrb created this task.Sep 19 2013, 2:59 PM

The data structure would have to be rather more complicated than that. At first guess, something along the lines of (in JSON):

"diff": [

{ "line": 1, "type": "context", "content": "Line" },
{ "line": 2, "type": "removed", "old": "Line" },
{ "line": 2, "type": "added", "new": "Line" },
{ "line": 3, "type": "context", "content": "Line" },
{ "line": 47, "type": "context", "content": "Line" },
{ "line": 48, "type": "changed", "old": "Line", "new": "Line" },
{ "line": 49, "type": "context", "content": "Line" }

]

If you want indication in the line of what changed for "changed" types, that's another complication. Instead of just "Line" it would have to be an array of fragments. One simple way might be that even array indexes are unchanged and odd are changed:

"old": [
    "foo bar ",
    "",
    "quux ",
    "poop",
],
"new": [
    "foo bar ",
    "baz ",
    "quux ",
    "etc.",
]

That might indicate that "baz" was inserted into the list and "poop" at the end was replaced with "etc.". Or maybe it would be better to combine "old" and "new" into one datastructure somehow.

Also, keep in mind that lots of little objects can use a surprising amount of memory (see bug 53663).

I think that for beginning splitting new text and old text would be enough, right now it's hard to find out what was added by user and what was there before they edited the page

Character vs. line offset

I'd much rather represent diffs based on a character offset I'm afraid of representing position with something like lineno since linebreaks are differently defined between systems. Character offsets would also allow us to make changes to our diff detection strategy without changing the output.

Machine readable vs. human readable diffs

Machine readable diff opcode formats tend to represent the full set of operations used to recreate a revision -- not just the context. A common format that I'm familiar with would something like this:

a = "These are wrd."
b = "These are words."
{

diff: [
  {
    op: "equal",
    a_start: 0,
    a_end: 10,
    b_start: 0
    b_end: 10
  },
  {
    op: "remove",
    a_start: 10,
    a_end: 13,
    b_start: 10,
    b_end: 10,
    content: "wrd",
  },
  {
    op: "insert",
    a_start: 13,
    a_end: 13,
    b_start: 10,
    b_end: 15,
    content: "words",
  },
  {
    op: "equal",
    a_start: 13,
    a_end: 14,
    b_start: 15,
    b_end: 16
  }
]

}

compressed format:

I don't see the value in compressing the format given that the API doesn't really let you query for more than one diff at a time and diffs tend to be represented in few operations. However, we could simply represent each operation as a tuple with agreed upon field order:

{
  op: "insert",
  a_start: 13,
  a_end: 13,
  b_start: 15,
  b_end: 18,
  content: "foo"
}

could be

[
  "insert",
  13,
  13,
  15,
  18,
  "foo"
]

or if we really want to get a tight format (since the rest of the fields are derivable in a sequence of operations).

[
  "insert",
  15,
  "foo"
]

(In reply to Aaron Halfaker from comment #3)

Character vs. line offset

I'd much rather represent diffs based on a character offset I'm afraid of
representing position with something like lineno since linebreaks are
differently defined between systems.

Isn't that an argument for line-based rather than chatacter-based offsets?

Character offsets would also allow us
to make changes to our diff detection strategy without changing the output.

Machine readable vs. human readable diffs

Machine readable diff opcode formats tend to represent the full set of
operations used to recreate a revision -- not just the context.

OTOH, what is the usual use of querying the diffs? I suspect it's more often that the client is wanting to display a human-readable diff to the end user than because the client is wanting to do the equivalent of the 'patch' utility on an already-downloaded local copy of the article.

and diffs tend to be represented in few operations.

On talk pages, maybe. But someone heavily copyediting an article is likely to generate a huge number of operations. With the way the diff algorithm works, even some simple edits will generate many operations as it tries to match up individual letters in the old vs new paragraphs.

I talked with Tom Morris recently and I believe that this feature would make it easier to provide meaningful diffs on wikidata as well which don't use classic content model

tommorris subscribed.Feb 12 2015, 1:55 PM

Anomie moved this task from Unsorted to Needs details or plan on the MediaWiki-Action-API board.Feb 19 2015, 6:54 PM

Jaeol awarded a token.Apr 15 2015, 6:40 AM

Jaeol subscribed.

Exposing the array or unified format in the API would be useful. The array seems more useful, but the unified string format has the advantage of being a defacto standard that lots of application will know how to deal with.

We currently already have a DiffFormatter interface, implemented as TableDiffFormatter, ArrayDiffFormatter, and UnifiedDiffFormatter.
Considering these already exist and take the same Diff object, it looks like it wouldn't be very complicated at all to expose these.

Krinkle renamed this task from Make it possible for edit diff to be provided as a raw text to API should provide revision diffs in machine readable format.Jun 12 2015, 1:22 AM

Krinkle set Security to None.

Krinkle removed a subscriber: • wikibugs-l-list.

Following conversation with @Halfak at the Lyon Hackathon 2015, we agreed the unified format isn't all that useful. We can consider exposing it if there's actual interest expressed by users, but based on the needs by ClueBot, Huggle, ORES, and other tools, an array format would avoid re-inventing diff logic everywhere.

But the existing array format isn't very useful, and actually has some logic errors on it (probably because it was never tested). Ideally we'd come up with an array format that exposes intraline diffs (word-level diffs).

MediaWiki already has logic for this in the HTML formatter, but that's not currently written in a re-usable way. The format of https://phabricator.wikimedia.org/T56328#571222 seems reasonable. @GWicke also has ideas for this.

Once implemented, we can expose it through ApiRevisionQuery for starters. See T100082 for more information on how we can make those diffs available for high-volume consumers.

Krinkle mentioned this in T100082: Provide useful diffs to high-volume consumers of recent changes.Jun 12 2015, 1:49 AM

Krenair subscribed.Jun 12 2015, 1:52 AM

Krinkle added a parent task: T100082: Provide useful diffs to high-volume consumers of recent changes.Jun 12 2015, 2:01 AM

jayvdb subscribed.Jun 17 2015, 2:42 AM

Krinkle renamed this task from API should provide revision diffs in machine readable format to Provide intraline diff format in revision query API.Nov 2 2015, 10:44 PM

Krinkle mentioned this in T117279: [EPIC] Core should provide inline diffs as well as side by side (Move InlineDifferenceEngine into core / remove MobileDiff).

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2015, 10:45 PM

Krinkle updated the task description. (Show Details)Nov 2 2015, 10:47 PM

Catrope unsubscribed.Nov 13 2015, 7:43 PM

Yair_rand subscribed.Apr 4 2016, 9:23 PM

Krinkle added a subtask: T117279: [EPIC] Core should provide inline diffs as well as side by side (Move InlineDifferenceEngine into core / remove MobileDiff).Apr 21 2016, 11:49 PM

WMDE-leszek subscribed.Jun 1 2016, 7:36 AM

I would fine some kind of inline diff output from the API very useful. I've seen excellent response from new users to the Special:MobileDiff version of diffs, compared to traditional diffs: they use the same design concept as 'track changes' in word processor documents, so users new to wikitext don't get as hung up on all the layout and code... especially if what they are looking at involves primarily changes of textual content.

I want to build a diff viewer (for dashboard.wikiedu.org), and I was sad to discover that inline diff html is not available right now except by scraping Special:MobileDiff... which may be the approach I have to take for now.

Halfak added a parent task: T104072: Implement single column diff option (from mobile app).Nov 3 2016, 2:27 PM

Related task: T157177

Task description: "There does not seem to be any way to get diff information in any other format than rendered HTML (which, beyond being ugly as hell, is unsuitable for analysis, conflict resolution and any number of other things one might want diff data for). There should be a prop=revisions flag that makes it output diffs in JSON or some other machine-readable format."

"Example of such an API call:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=revisions&titles=Apple&rvprop=content%7Cids%7Cuser%7Ccomment&rvlimit=max&rvdiffto=prev"

"Parsing HTML certainly does lead to a plethora of errors."

In T56328#571210, @Anomie wrote:

"diff": [

{ "line": 1, "type": "context", "content": "Line" },
{ "line": 2, "type": "removed", "old": "Line" },
{ "line": 2, "type": "added", "new": "Line" },
{ "line": 3, "type": "context", "content": "Line" },
{ "line": 47, "type": "context", "content": "Line" },
{ "line": 48, "type": "changed", "old": "Line", "new": "Line" },
{ "line": 49, "type": "context", "content": "Line" }

]

Something like @Anomie's initial sketch seems ideal.

zhuyifei1999 subscribed.Feb 4 2017, 6:09 PM

Anomie merged a task: T157177: MediaWiki API should return diffs in a sane format.Feb 6 2017, 12:40 AM

Anomie added subscribers: TerraCodes, Tgr.

Something to consider: On Wikidata, diffs for items don't have lines.

Jaeol unsubscribed.Feb 6 2017, 10:24 AM

@Yair_rand How might we use Wikidata's diff structure in the API for Wikipedia? It's often easier to list changes on Wikidata because changes are made to specific properties/fields, rather than to bodies of text.

Anomie mentioned this in T164106: Deprecate parsing and diff options in ApiQueryRevisionsBase.Apr 28 2017, 7:12 PM