Page MenuHomePhabricator

Implement GET Revision Comparison
Closed, ResolvedPublic3 Estimated Story Points

Description

These task depends on T231586

Acceptance Criteria:

  • Define route for /revision/{id}/compare/{other}
  • Ensure wiring is available to wikidiff2 instance described in T231586
  • Define route handler for compare that handles the following:
  • Request:
    • Must support HTTP GET/HEAD only
  • Response:
    • Response must return JSON
    • Responses JSON must have structure:
{
    "from": {
      "id": 889268680
    }
    "to": {
       "id": 848883822
    }
    "diff": /* output from wikidiff2 */
}
  • diff: A diff giving the changes associated with the specified slot role. The diff format may depend on the content handler but is expected to be a JSON diff as generated by wikidiff2_inline_json_diff(). Such a JSON diff is an object with a single top-level key "diff".
  • Response headers must contain:
  • Response Statuses
    • 200 - OK
    • 400 - from and to IDs are from different pages or content type is not wikitext
    • 404 - One or both version IDs don't exist

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@eprodromou We need to define Diffs here, is there an example structure?

WDoranWMF renamed this task from Implement revision comparison to Implement GET Revision Comparison.Aug 29 2019, 6:21 PM

@WDoranWMF I think we've got an output format from @Tsevener ... ? I'll see if I can track it down.

WDoranWMF set the point value for this task to 3.

Namespacing, and the details of how it will impact this endpoint, are being discussed in task T232485.

@eprodromou , in code review of T232355, it came up that the path

/version/{id}/compare/{other}

may be more of a "revision" thing than a "version" one. Is there a reason we chose "version", or should this route be renamed?

@BPirkle that was my mistake, that should have been updated to revision, changing description now.

To repeat my comment from T231347: The diff format is needed, and JSON is not one of the options. I think the options are "table" (wikidiff2_do_diff), "inline" (wikidiff2_inline_diff) and "structured" (wikidiff2_inline_json_diff, T232231). So we would have e.g. /revision/{id}/compare/{other}/table , which would have HTML in JSON. For "structured", hopefully it is not necessary to double-encode the JSON.

Why is it diffs plural? A diff is what you get when you compare two revisions.

Since there was no response to my questions here, I'll just update the task description to address these problems.

I asked Evan to reply to this but in the interim, this structure seems to come directly from the iOS team and is, I think, reflected in their updates to wikidiff2 but it would be good to have that confirmed.

I made the following changes. Please revert if any are unacceptable:

  • Added a format parameter to the path, here /structured, so that we may also add /table and /inline.
  • Changed method specification from GET only to GET or HEAD, to conform with the HTTP specification and existing routing code (T226043)
  • Removed the requirement that the request body be empty. RFC 2616 prohibits such a request body, whereas RFC 7231 says that such a request body has "no defined semantics", which might be interpreted as a suggestion to ignore the body, which is the normal handling of such requests. In any case, the endpoint seems like the wrong place to implement such a requirement. I don't think any endpoint should interpret a GET request body: RFC 7231 says that "sending a payload body on a GET request might cause some existing implementations to reject the request", which seems like a good enough reason to not do that. Proxies may fail to forward it. The best way to enforce this handling would be to have RequestFromGlobals::getBody() always return an empty stream for a GET request.
  • Specified that "diffs" contains a JSON object mapping MCR slot role name to the result of wikidiff2_inline_json_diff(), rather than "TBD".
  • Renamed "title" to "from_title" and added "to_title" since there was no requirement that the revision IDs refer to the same page.
  • For clarity given the above change, renamed "from" and "to" to "from_revision" and "to_revision" respectively.
  • It said that a 304 should be generated if there is "no change between revisions". I'm not sure if this is just unclear wording, but RFC 7232 defines the 304 status code as something which can only be used in response to IMS/INM, it is not possible to send it if the from and to revisions merely contain the same content. If the from and to revisions merely contain the same content, the diff array will be empty, thus allowing the client to receive the title and other metadata.

The tricky part of this is the interaction with MCR and ContentHandler, and the question of what we should do if a ContentHandler cannot provide a diff in the requested format, say due to wikidiff2 not being installed. Is it correct to give a 500 error? What if the main slot is representable as requested but a secondary slot is not?

Is it better to define this endpoint as requesting diffs only on the main slot? If so, then we could use content negotiation instead of a format parameter, and if wikidiff2 is not installed we could just return a 406 Not Acceptable status code.

If only the main slot is to be returned, how should other slots be requested? Should the slot role name be in the path, i.e. /main initially with later extension to allow diffs for any slot?

Content negotiation version of this, for consideration:

  • Path: /revision/{id}/compare/{other}/{role}
    • {id} is the "from" rev_id
    • {other} is the "to" rev_id
    • {role} is the slot role, typically "main"
  • Request headers:
  • Response headers:
    • Content-Type: one of
      • application/json+structured
      • application/json+htmltable
      • application/json+htmlinline
  • The server's preferred content type, to be delivered for e.g. "Accept: */*", shall be application/json+htmltable, since a pure PHP implementation is available, and ContentHandler implementations currently support only this type. The iOS app could accept only application/json+structured, or could specify a fallback sequence.
  • If content negotiation fails then a 406 status shall be returned.

Having written it out, it's kind of neat. Maybe we can pull in a library to do the Accept header parsing.

In the content negotiation version you would also have a directory index style endpoint /revision/{id}/compare/{other}. This would just list the role name of each modified slot.

Some additional considerations which apply to either interface style:

  • If one of the revisions is revdel'd or otherwise inaccessible with the FOR_PUBLIC audience, we would return a 403. It's important to check accessibility even for INM/IMS requests.
  • This endpoint is not purgeable. It could have an Expires header with a very short expiry time, but eventually the response has to be revalidated to recheck accessibility.
  • 403s could also have an ETag and a short Expires, it's not necessary to make them completely uncacheable.
  • 404s could have a short Expires, although we have to be careful with replication lag when accessing new revisions.
  • The ETag only needs to encode the parameters and accessibility. The content does not vary otherwise. In the content negotiation case, the ETag has to include the negotiated response type. It could be say sha1("$from:$to:$exists:$accessible:$format")
  • There needs to be object caching in MW since diffs are very expensive to generate (potentially tens of seconds), and the CDN cannot be relied upon to cache them.
  • We can have Cache-Control headers that allow ATS/Varnish to keep the object in cache for a long time but with regular (every few seconds) INM revalidation. In the content negotiation interface style, this requires "Vary: Accept".
  • For whitelist read wikis, CDN caching needs to be suppressed.

It's not possible to do an IMS/INM revalidation of a 403 or 404 response, since a 304 status code is defined as implying that the status code would have been 200. So the best we can do for those responses is a short Expires.

If a page is renamed, the cache is invalidated, even though the diff does not change, because the response as proposed includes the page title. This makes it very difficult to support IMS requests, since we would have to know when the page was last moved. There is an index on log_page but it is marked as a candidate for removal in tables.sql on account of it not being used for anything in production. But even if we used it, the logging table is not usually used for cache invalidation so interpretation would be difficult, and it may be incomplete, not containing all moves.

If it wasn't for the inclusion of the page title in the response, the Last-Modified date could just be the more recent of the two revision timestamps.

INM can be trivially and efficiently handled by including the page title in the hash, and is sufficient for CDN caching. So my question for @EvanProdromou is whether the proposed client really needs IMS to be efficiently implemented, or whether INM is sufficient. If IMS is really a hard requirement, then can the page title be dropped from the response body?

Is the title meant to be the DB key, or the text form, or the display title (HTML formatted), or a plain text representation of the display title? If it's the display title then should it be in the user's variant?

Change 538461 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] [WIP] REST compare endpoint (MCR bundle variant)

https://gerrit.wikimedia.org/r/538461

In the WIP patch, instead of "from_revision" and "from_title" I had a nested structure:

{
  "response": {
    "from": {
      "id": 1522,
      "title": "Main_Page"
    },
    "to": {
      "id": 1526,
      "title": "Main_Page"
    },

I'm changing the task description to match. Again, feel free to revert. It was just convenient to handle the two revisions in a structured way.

@tstarling instead of "/structured", can we just use "/json", as with other endpoints?

I closed T232488 with a comment summarising the discussion between me and @eprodromou regarding /json suffixes. Everything is JSON. I'm not sure if we resolved on the exact name of the /structured suffix but I think it would be confusing to use /json when all the formats are a kind of JSON, hence the need to think of another name for it.

@eprodromou was happy to drop the titles from the response in order to allow Last-Modified and If-Modified-Since to be efficiently implemented. That reopens the question of the structure of the "from" and "to" response keys. If we leave it the same, i.e. {"from": {"id": 1522}, "to": {"id": 1526}} that will leave a good place to extend the endpoint to add more revision metadata in the future, so I think that's what I prefer.

We discussed MCR bundling versus a single-slot endpoint, and I believe @eprodromou preferred a single-slot endpoint with the slot role name in the path. We didn't discuss content negotiation.

I updated the task description according to the above comment. I also added "slot_role" to the "from" and "to" objects, giving the slot role name.

  1. No content negotiation, please.
  2. I don't want the slot name in the URL. For now, we are comparing the main slot. I've updated the URL template back to the original.
  3. I've taken the "response" indirection out of the response; I'd prefer if we didn't use that idiom.
  4. I've added the requirement from the user story that the from and to must be revisions of the same page. I realize that other interfaces allow it; I'd like to keep that restriction for this endpoint and see if we need the expanded version in the future. As you point out, it saves us a lot of page metadata in the output. I've added an error code for when that's the case.
  5. I've left your bundling of properties of the revisions as an object. I think that's a good pattern and something we should document in design principles.
  6. If the from/to are revisions of the same page, I think we don't need a title.
  7. For now, I'd prefer if we just throw an error if the content type is not wikitext. We can add in support for other text content types or even binary content types later.

In the future, if we need to compare revisions from different pages, or compare different slots, or content types, let's create REST API endpoints for those, or we'll revise this one.

Until then, let's use the YAGNI principle to keep the cognitive load on developers as low as possible, and to lower the effort to get an implementation done.

  1. I don't want the slot name in the URL. For now, we are comparing the main slot. I've updated the URL template back to the original.

This makes this API unusable for commons. It also means that it will have to change when we improve content modularity, e.g. my making the page summary directly editable.

  1. I've added the requirement from the user story that the from and to must be revisions of the same page.

Yes, please. Diffs between revisions of different pages makes things quite painful.

  1. If the from/to are revisions of the same page, I think we don't need a title.

Not having the title in the URL means we can't optimize based on the title or page. Such optimizations may become relevant on all levels, from the network edge to the database level. I'd recommend to have it in the URL *especially* if bnoth revisions always belong to the same time.

  1. For now, I'd prefer if we just throw an error if the content type is not wikitext. We can add in support for other text content types or even binary content types later.

As a baseline, I see no problem with that. But keep in mind that this means the API is not usable to review edits to critical pages such as common.js.

In the future, if we need to compare revisions from different pages, or compare different slots, or content types, let's create REST API endpoints for those, or we'll revise this one.
Until then, let's use the YAGNI principle to keep the cognitive load on developers as low as possible, and to lower the effort to get an implementation done.

I think we should be considerate of the long-term cognitive load, not just the short-term load. Introducing more similar-but-not-quite-the-same APIs down the road is going to be confusing for client developers, and maintenance burden for core developers. One-off solutions that need to be maintained forever but do not cover all use cases we need, leads to more one-off solutions. This is one of the reasons we have so much madly maintained code.

I agree that focusing on the use case at hand is the right thing to do. But completely disregarding future use cases which are already on the horizon doesn't seem wise.

This makes this API unusable for commons.

So, I think we should discuss with iOS team how they would visually represent changes to different slots.

It seems like the Web interface for diffs when they have the mediainfo slot filled is pretty straightforward.

But, yes, agreed, this seems pretty important for commons and should be addressed soon.

  1. If the from/to are revisions of the same page, I think we don't need a title.

Not having the title in the URL means we can't optimize based on the title or page.

We were discussing title in the output! I'm not crazy about requiring a title in the URL, since it's requiring the client developer to provide data we don't actually need to get a unique diff. (We can do that with just the two unique revision IDs.)

  1. For now, I'd prefer if we just throw an error if the content type is not wikitext.

As a baseline, I see no problem with that. But keep in mind that this means the API is not usable to review edits to critical pages such as common.js.

Agreed. I think that many of our "exotic" content types, like JS and CSS, will be perfectly fine for doing line-by-line text diffs, as with wikitext. For structured data, we may want to have a different format, and if we get into binary content types (which I think is possible but not yet used...?) we'd need to figure that out.

I agree that focusing on the use case at hand is the right thing to do. But completely disregarding future use cases which are already on the horizon doesn't seem wise.

Agreed. I see a few ways we could support having more slots and different content types with backwards compatibility.

We were discussing title in the output! I'm not crazy about requiring a title in the URL, since it's requiring the client developer to provide data we don't actually need to get a unique diff. (We can do that with just the two unique revision IDs.)

Oh, right. I personally don't care whether the title is in the output or not.

I do care about it being in the input. We *do* need the Title internally, for permission checks. We end up loading the title based on the revision ID.

To me, the fact that we can map from revision ID to title is an implementation detail, which we may not want to guarantee forever.

Every public API where we allow revision IDs without titles as input forces us to maintain infrastructure for making this lookup possible, and prevents us from optimizing based on title. I don't see a use case that would justify imposing this limitation on ourselves. I can't think of a situation where a client would have revision IDs without having knowledge of the page title they belong to.

If you put the title in the URL then you are back to the same place with inability to support IMS revalidation. We need to preserve the ability to map from the revision ID to the page in order to support Last-Modified headers. To recap: if the title appears in the output (or the path) then the Last-Modified header depends on the move history, which we don't store in a structured, usable form.

If you put the title in the URL then you are back to the same place with inability to support IMS revalidation.

You mean, when a page is renamed, cached copies of the response in the CDN would be become orphaned? Right, I didn't think of that. But it doesn't seem terrible to me. If you put the title into the response bot not into the path, that would cause problems with stale cache entries. If you put it in the path, the problem goes away, because you'd just have a cache miss, right?

Also, page moves are relatively rare, and we currently don't cache diffs in CDN at all, afaik. The internal (object cache) key can continue to be based on just the revision IDs, so that could continue to work, unless we put the title into the cached blobs.

If we have two explicit revision IDs and a title in the path, we should be able to do IMS without an internal lookup, since we already have the title (we can check later if the revisions actually belong to the correct page ID). In my understanding, this should make things simpler for IMS, not harder. Am I missing something?

Now, if we have relative revision ids, such as prev, next, or current/0, then IMS depends on the edit history. But we should still be able to check IMS based on the page's touch timestamp.

For the record, I don't have terribly strong feelings about this, and I don't want to block progress. It just seems like we are passing on an opportunity for optimization, and I'd like to understand the tradeoff involved. I'm probably just missing something obvious.

Every public API where we allow revision IDs without titles as input forces us to maintain infrastructure for making this lookup possible, and prevents us from optimizing based on title. I don't see a use case that would justify imposing this limitation on ourselves. I can't think of a situation where a client would have revision IDs without having knowledge of the page title they belong to.

From the REST API design principles:

One unique identifier should be enough to get the data you want. If the client developer has a unique identifier available, they shouldn’t have to put together a lot of other data to form a URL. So, if they have a revision ID 12345, they should be able to GET /revision/12345, instead of digging around somewhere for a page title to compose /page/France/revision/12345.

Now, if we have relative revision ids, such as prev, next, or current/0, then IMS depends on the edit history. But we should still be able to check IMS based on the page's touch timestamp.

I should note that this endpoint is for explicit revision IDs, and not named relative ones like "prev", "cur", "next", "first" or "last".

I should note that this endpoint is for explicit revision IDs, and not named relative ones like "prev", "cur", "next", "first" or "last".

Sure. The question is whether it is wise to build it in a way that would require us to introduce a new endpoint if we end up needing them afterall. I'm not saying we should build features we don't need, I'm saying we should architect for flexibility, if the cost for doing so isn't high. The cost for havign the title in the URL is, as far as I can tell, zero.

From the REST API design principles:

Which REST API design principles?

EDIT: found it, and commented there. I'm basically arguing for "leave space for change".

One unique identifier should be enough to get the data you want. If the client developer has a unique identifier available, they shouldn’t have to put together a lot of other data to form a URL. So, if they have a revision ID 12345, they should be able to GET /revision/12345, instead of digging around somewhere for a page title to compose /page/France/revision/12345.

I agree that it would be bad if additional requests would have to be made to discover a title. I just can't think of a real life use case for MediaWiki where a client would have a revision ID without knowing the title it belongs to. Do you have an example of how that could happen? If we have such use cases, then never mind what I said. My argument is entirely based on the assumption that if there are any such use cases, they are very rare.

@tstarling per our discussion I'm wiping the cache header requirements from all these endpoints. I'll add separate tasks for providing them in the future.

@WDoranWMF @BPirkle @daniel Do we need to do anything to move this task along? Is it possible to merge it when Tim isn't available? Do we need a reviewer?

Just speaking for myself, I understand what this task is about conceptually, but I don't know anything about its status or the details of the code. I'm happy to take a closer look if time allows, but I assume my priority is to complete the tasks that I'm the primary coder for.

If there is someone already familiar with this code that could review it, that would be very helpful.

Change 542212 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Conditional request support for REST compare endpoint

https://gerrit.wikimedia.org/r/542212

Change 538461 merged by jenkins-bot:
[mediawiki/core@master] REST compare endpoint

https://gerrit.wikimedia.org/r/538461

The handler for this route doesn't seem to check read permissions for the pages it is comparing. That means it could be used to bypass per-page read permissions. The DifferenceEngine class responsible for showing diffs in the UI does apply these checks, though I note that ApiComparePages seems to be lacking them as well.

This state of incomplete support for fine grained read restrictions seems dangerous. We should be clear on either supporting it, or dropping support for it. Enforcing the restrictions only in some places is misleading. My suggestion would be to add the appropriate permission checks wherever we find them missing.

The handler for this route doesn't seem to check read permissions for the pages it is comparing.

I started T235528 for this bug.

In documenting this endpoint, I've run into a few possible improvements to the user experience.

The diff produced by wikidiff2 is an object that contains a diff array and a sectionTitles array. (This info comes from Toni, who implemented this in wikidiff2.) Condensed, it looks like:

{
  "diff": [
    {
      "type": 0,
      "...": "..."
    }
  ],
  "sectionTitles": []
}

The response we're currently getting from the compare revisions endpoint includes a couple of extra wrappers that makes this a bit more confusing. Again, condensed:

{
  "response": {
    "from": {
      "id": 388864,
      "slot_role": "main"
    },
    "to": {
      "id": 388863,
      "slot_role": "main"
    },
    "diff": {
      "diff": [
        {
          "type": 0,
          "...": "..."
        }
      ],
     "sectionTitles": [],
    }
  }
}

If it's possible to remove the response wrapper and the extra diff wrapper, that would help make this response easier to understand and manage.
(Note: sectionTitles isn't currently included in the responses from the beta cluster, so I'm making an assumption about where it will eventually show up in the response.)

@apaskulin Awesome. I've created T235866 and T235864 to capture those two issues.

Because I created the two other issues for the endpoint, I'm marking this done, but I can't mark the user story done until the output is correct.

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks

Change 542212 abandoned by Ppchelko:
Conditional request support for REST compare endpoint

Reason:
Not relevant anymore

https://gerrit.wikimedia.org/r/542212