Page MenuHomePhabricator

Deprecate comparing first revision to prev, last revision to next in API
Open, MediumPublic

Description

Currently, relative comparison at first and last revision in the API has behaviour that is defined but conceptually inconsistent. @Anomie has proposed that we change the behaviour to be more consistent with the conceptual model of a series of revisions, as well as the behaviour of the Web UI.

One mechanism for making this change would be an interim step where a deprecation warning is given in the "warnings" property of the results. After this, the result would be a baddiff error as in https://phabricator.wikimedia.org/T203433.


Reasons for doing this include:

  • Consistency with all other cases where a specified revision does not exist or is inaccessible, such as
    • Specification of a page that does not exist.
    • Specification of a revision ID that does not exist.
    • Specification of a deleted revision, when the user lacks the necessary rights to view it.
    • Specification of a revision with hidden or suppressed content (RevDel), when the user lacks the necessary rights to view it.
    • Attempting to use torelative=prev or torelative=next relative to a deleted revision, where it's unclear whether that should be the previous/next deleted revision, the previous/next live revision by timestamp, or the previous/next revision by timestamp regardless of whether it's live or deleted.
    • Attempting to use torelative=cur with a revision from a deleted page.
  • Removing an inconsistency in non-error behavior with the web UI, which displays "no differences" rather than a diff from an empty revision for the previous-to-first case.
  • Avoiding the fact that an empty revision for the previous-to-first case is not correct when a page has default content, as for example is the case for messages in the MediaWiki namespace.
    • For example, the first real edit to enwiki's MediaWiki:Protectedpagetext was this diff. Absent that "MediaWiki default" revision, we'd have no way of knowing that that was the default content of the page at the time the edit was made.

Clients currently requesting such comparisons would have a few options, including

  • Use prior knowledge that the revision in question is the first/last, and avoid making the call to action=compare.
  • Check whether the from-revision has a previous or next revision, e.g. with action=query&prop=revisions&rvstartid=X&rvlimit=2, and avoid making the call to action=compare.
  • Make the request to action=compare and handle the error, analogous to pseudocode code like
try {
    $result = $api->compareToPrevious( $rev );
} catch ( NoPreviousRevisionException $ex ) {
    /* Populate $result in some other way */
}

Also note that, at least for the case of a wikitext page, you can safely get a diff from an empty revision by using the fromslots and fromtext-main parameters, as for example https://en.wikipedia.org/w/api.php?action=compare&torev=1906920&fromslots=main&fromtext-main=.

Event Timeline

After this, the result would be a baddiff error as in https://phabricator.wikimedia.org/T203433.

More likely the result would have a distinct code, probably "compare-no-prev" or "compare-no-next".

Change 481190 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] ApiComparePages: Deprecate special behavior with no prev/next revision

https://gerrit.wikimedia.org/r/481190

So, I think we have a conceptual issue here that's probably worth discussing.

In versioning systems you have two choices of what you can store. You can either store each state that the document has been in (state A, B, C, D, ...), or you can store the differences between state (delta(null, A), delta(A, B), delta(B, C), delta(C, D), ...).

In MediaWiki we store the full state, and we provide this API function to compare two states.

But some users and developers think about the history in terms of deltas. We talk about edits, contributions, or changes.

Maybe we should support that metaphor for version history with a different but very similar API action, like asdiff ("show me this revision as a diff").

It would return the value of a revision as a difference from the previous state (where the previous state is defined as null for the first revision). The only arguments would be the revision ID and output formatting options. It would work exactly like compare?torelative=prev, with the exception that for the first revision there would be a meaningful value.

I think the rough pseudocode would be something like:

function asdiff($revid, $options) {

if (isFirst($revid)) {
  return diff of $revid from empty page;
} else {
  return compare($revid, previousRevision($revid), $options);

}
}

This might be helpful for client developers who are looking at revisions as diffs, and would let some conceptual pressure off of compare, which does seem to be more about having the difference between two states.

This might be helpful for client developers who are looking at revisions as diffs, and would let some conceptual pressure off of compare, which does seem to be more about having the difference between two states.

Exactly. What you've described is basically git show which is how we use compare on the InteractionTimeline. Or from their documentation:

For commits it shows the log message and textual diff.

when I run it it always uses the parent commit to diff against, or it will use null if it's the first commit.

I think what our compare endpoint does is basically what git diff does, so having another endpoint may help reduce the confusion.

I suppose another way to say it would be...
I (as the client) want the revision as a delta, I don't care how it is stored.

Maybe we should support that metaphor for version history with a different but very similar API action, like asdiff ("show me this revision as a diff").

Note we deprecated the ability to get diffs from prop=revisions a while back as it was already severely limited for performance reasons. See T164106: Deprecate parsing and diff options in ApiQueryRevisionsBase.

It would return the value of a revision as a difference from the previous state (where the previous state is defined as null for the first revision).

I note that, with selective undeletions, imports, history merges, and so on, the identification of "previous state" is a bit nebulous to be a first-class concept.

I note that, with selective undeletions, imports, history merges, and so on, the identification of "previous state" is a bit nebulous to be a first-class concept.

I would expect it to diff against null if rev_parent_id is 0. If that database field is not accurate, then that is a different issue that should probably be resolved.

I would expect it to diff against null if rev_parent_id is 0. If that database field is not accurate, then that is a different issue that should probably be resolved.

See also T185167 and T193690, which are both unfortunately extremely overcomplicated.

See also T185167 and T193690, which are both unfortunately extremely overcomplicated.

I think those are edge cases that ought to be resolved independently of this issue.

@Anomie There shouldn't be a significant performance difference between revision-as-diff and compare, except for the first revision.

I understand that complicated version histories could make "the previous revision" tricky. We can document that. But we already deal with a lot of this with compare, right? Whatever "prev" means for compare, it means here, except for the first revision.

@Anomie There shouldn't be a significant performance difference between revision-as-diff and compare, except for the first revision.

The performance difference is when one request is calculating 500 diffs instead of just 1.

I understand that complicated version histories could make "the previous revision" tricky. We can document that. But we already deal with a lot of this with compare, right? Whatever "prev" means for compare, it means here, except for the first revision.

With compare a diff isn't trying to be a first-class concept, though. A compare can fail for all sorts of reasons and it's not a big deal, while it may be more problematic for an interface that claims to have the diff as the thing for every revision.

For our use case, it's fine to only be able to retrieve one diff per request.

My 2¢:

  • next/prev should not use rev_parent_id, at least for now. They should be consistent with the listing in the page history, which is simply sorted by timstamp (and, if two timestamps are the same, by revision id), ignoring any deleted revisions. This is the canonical sequence of the page history, and should thus be the basis for prev/next in diffs.
  • A diff conceptually represents the data added and removed by a single edit.
    • this notion is problematic when the "previous" revision in the sequence is not actually the revision the edit was based on, because that revision has since been deleted/suppressed. This situation can be detected by comparing rev_parent_id to the current previsous revision. If they are different, this should be communicated somehow to the client, so a warning can be shown to the user.
  • diffing the latest revision against "next" is thus an error, or at least "nothing": there is no edit after the latest revision, so there is no diff. An empty diff could be used, but I find this semantically dubious.
  • diffing the first revision against "prev" is not an error: the edit that created the page did add things, and this can and should be represented as a diff. A diff against empty content. ContentHandler in fact has a method for instantiating an empty content object primarily for this use case.
  • this notion is problematic when the "previous" revision in the sequence is not actually the revision the edit was based on, because that revision has since been deleted/suppressed. This situation can be detected by comparing rev_parent_id to the current previsous revision. If they are different, this should be communicated somehow to the client, so a warning can be shown to the user.

As detailed elsewhere, it's not clear whether rev_parent_id was originally intended to point to the current previous revision linked-list style or was intended to identify the previous revision at the time of creation without regard for later changes. Historically the size-change indicator in page history was assumed to be consistent with the former, most operations worked consistent with the latter, and undeletion didn't work consistent with either definition (although in practice most uses would have been consistent with the former) but is now consistent with the latter.

Also, if "deleted/suppressed" includes revision-deletion, the parent revision is still there in the page history and by either definition would still be the previous revision.

  • diffing the first revision against "prev" is not an error: the edit that created the page did add things, and this can and should be represented as a diff. A diff against empty content. ContentHandler in fact has a method for instantiating an empty content object primarily for this use case.

But as noted earlier, the empty content is not necessarily what the edit that created the page was actually based on, for example for messages in the MediaWiki namespace. And as noted earlier, if someone really wants a diff against empty content they can still achieve that by supplying empty content explicitly with a request like https://en.wikipedia.org/w/api.php?action=compare&torev=1906920&fromslots=main&fromtext-main=.

I also note that the web UI has different behavior in this situation: it displays an empty diff as in the previous bullet.

eprodromou subscribed.

Per product discussion, I'm going to work on this proposal and see what we can do to go forward with it.

Change 481190 abandoned by Umherirrender:

[mediawiki/core@master] ApiComparePages: Deprecate special behavior with no prev/next revision

Reason:

Old outdated patch set

https://gerrit.wikimedia.org/r/481190