Page MenuHomePhabricator

Implement GET Page Revision
Closed, ResolvedPublic2 Estimated Story Points

Description

Acceptance Criteria:

  • Define route for /revision/{id}/bare
  • Define route handler for revision that handles the following:
  • Request:
    • Must support HTTP GET only
    • Request body must be empty
  • Response:
    • Response must return JSON
    • Responses JSON must have structure:
{
    "id": 889268680,
    "page": { 
       "id": 83784732,
       "title": "Some really great title"
    },
    "user": {
       "name": "The Blade of the Northern Lights",
       "id": 1234567
    },
    "comment": "This was a good edit",
    "timestamp": "2019-01-10T14:23:21Z",
    "size": 2000,
    "delta": -25
}
  • size must contain the size of the revision in bogobytes
  • delta must contain the difference in size from the previous revision

Event Timeline

WDoranWMF set the point value for this task to 2.Sep 4 2019, 12:19 PM

@eprodromou , just to confirm: when I was working on the Parsoid REST API endpoints, I was hitting urls like:

<stuff>/rest.php/localhost/v3/transform/wikitext/to/lint

But for the endpoint I'm implementing in this task it should simply be:

<stuff>/rest.php/revision/{id}

No version or other namespacing or decoration. Is that correct? I think so, because I was able to successfully hit:

<stuff>/rest.php/user/admin/hello

But I just want to make sure before I get too far on this next set of endpoints.

(<stuff> is my local wiki's host/port/etc and isn't really relevant to the question)

Namespacing is now covered in task T232485, and the details of how it will impact this endpoint are being discussed.

@eprodromou is there documentation for what should be returned in various error conditions? The most obvious and likely in this case would be if the revision could not be found, but I'm curious if we have any general conventions about error return values for the REST API.

I notice that ParsoidHandler.php tends to do things like:

return $this->getResponseFactory()->createHttpError( 400, [ 'message' => $errorMessage ] );

But some decisions in the Parsoid routes are for compatibility reasons, so not sure how applicable that is here.

But some decisions in the Parsoid routes are for compatibility reasons, so not sure how applicable that is here.

Regarding Parsoid errors, RESTBase is the only service talking to Parsoid directly currently, and it sends it's own errors to the clients. It doesn't depend on the details of the Parsoid error response, only the HTTP code. Thus I think we can change error body format to whatever we want. A good standard to follow in my opinion could be https://tools.ietf.org/html/rfc7807

Change 537221 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Core REST API handler for GET Page Revision

https://gerrit.wikimedia.org/r/537221

@BPirkle I think for errors we'd like to lift a page from the RESTBase handbook and use RFC 7807 error format.

I know that @tstarling did some work on error message internationalization in T226598. Tim, anything to watch out for here? Should/does the REST Router handle the error, or is it up to the handler?

Thus I think we can change error body format to whatever we want. A good standard to follow in my opinion could be https://tools.ietf.org/html/rfc7807

D'oh! @Pchelolo beat me to it.

In general, the response format doesn't seem to care about the fact we have slots now. I think any new API we're designing should take the existence of revision slots into consideration. Otherwise we're creating more tech debt.

html must contain the HTML output equivalent to /index.php?title=<title>&action=render

The result of action=render can be modified by a bunch of hooks. Do we want this to be modifiable by the same hooks?

Expires: far future (should be immutable)

  1. Revision deletions/suppressions make this not immutable. Parsing can be undeterministic (include a template that outputs a current date) - do we care about that?
  2. Why Expires? We use Cache-Control elsewhere.
  3. We don't really have a way to purge the core rest API routes from frontend caches, nor have we discussed the options on how to do that. This means we can't have actively purged content yet. Which means maximum we can do it have a couple of minutes max-age..

ETag: etag for this result set

  1. What would serve as an ETag? In RESTBase we've used to have "rev_id/render_id". We need to standardize on what we're using as Etag in core rest API.
  2. Are we using weak Etags or strong Etags?

html must contain the HTML output equivalent to /index.php?title=<title>&action=render

The result of action=render can be modified by a bunch of hooks. Do we want this to be modifiable by the same hooks?

I think the "equivalent to" means "without any skin or chrome or stuff like that". So, no.

Expires: far future (should be immutable)

[...]

Tim and I discussed dropping all the cache-related headers as too complex for a first revision. I'm going to capture them in another ticket. I deleted them from here.

In general, the response format doesn't seem to care about the fact we have slots now. I think any new API we're designing should take the existence of revision slots into consideration. Otherwise we're creating more tech debt.

I'm cool with adding slots, but I'd prefer that we delay that until after this is released.

Would it be possible to do something like this?

{
    "id": 889268680, /* ... */
    "html": "<div>...A lot of content...</div>",
   "slots": {
     "mediainfo": {
        "content": "....",
        "slot_specific_property": 23
       }
   }
}

So, all slots except "main" go into a slots property that maps the slot name to an object that describes the slot? @daniel any thoughts?

So, all slots except "main" go into a slots property that maps the slot name to an object that describes the slot? @daniel any thoughts?

@daniel , @eprodromou , does the task description still describe what we want for the initial rollout, or do we need to consider slot changes?

Also, Petr asked this in IRC:

I don't understand how it fits in the picture. We're doing it for the mobile app but we're returning desktop HTML?

Any thoughts or concerns there?

@eprodromou

html must contain the HTML for the page without any skin or UI, like action=render

Action=render gives you desktop version of the page. AFAIK, currently we're making it for the iOS app, and AFAIK that app doesn't use the desktop version, it uses MobileFrontend extension. If we want the output similar to what iOS already uses - we can't really put the route in core as is, it would depend on an extension.

Define route for /revision/{id}

I do not think this is a good route name as it's very not future-proof. There's so much stuff that could be returned for a revision. You might want HTML in various different format, you might want summary, images from a page, ORES score etc. If we define a route like this, adding endpoints will be extremely hard as we've already occupied a revision-id route. We need to incorporate a requested format into the route. Having a format in a query (possibly supporting lists of multiple formats) will be quite bad for frontend caching.

So, all slots except "main" go into a slots property that maps the slot name to an object that describes the slot? @daniel any thoughts?

We are using a similar approach for XML dumps, for backwards compatibility.

Treating the main slot different means that code that process as slots has to implement a special case for the main slot, increasing complexity. On the other hand, code that only cares about the main slot would be simpler, it can just ignore the concept of slots.

Which of the two makes sense depends on how we expect MediaWiki to develop in the future. With the frontend architecture work, I personally expect a move to more granular content models, so we would see things like page summaries, infoboxes, categories and such move into separate slots. This makes it more likely the clients are interested in more than the main slot.

However, all this may be besides the point of this use case. If we are talking about HTML for display, the "default" HTML for display for a revision is not the content of the main slot. It's the combined content of all slots. The content of just the main slot (and any other slot) is different from that, and most clients will probably not be interested in it.

Content (HTML or wikitext) for editing will however always be per slot. If we have the "default" content be the main slot's content in some cases, but the combined content in other cases, things become really confusing. So I'd recommend to make the main slot explicit when retrieving content for editing.

One way to do this would be to expose the slots as path components, instead of content structures. The /revision/{id} endpoint would return the combined HTML for display, and a list of available slots (role names), but no further info about them. To access the content and meta-data of a slot, the /revision/{id}/slot/{role} endpoint would be used.

This way, the causal consumer of revisions doesn't have to care about slots or content models, but we still don't have to treat the main slot in any special way.

Define route for /revision/{id}

I do not think this is a good route name as it's very not future-proof. There's so much stuff that could be returned for a revision. You might want HTML in various different format, you might want summary, images from a page, ORES score etc.

I agree. That endpoint seems ok for the revision meta-data, but there are too many "flavors" of HTML one could want (or not want) to make one definite and always serve it. How about /revision/{id}/html/display or /revision/{id}/html/iosapp or something like that?

I see discussion around the HTML we provide via the APIs and want to make sure this doesn't get lost…

a few weeks ago with @tstarling and @EvanProdromou we talked about the HTML GET page API provide. Since we are moving to a single parser that will be Parsoid, the HTML in the end point should be Parsoid content, not the PHP parser content. We don't want to have to version this API later after the PHP parser is deprecated.

Since I think the HTML here should be the same as we provide in the API, this should be the same for consistency it seems and should also be Parsoid output.

Based on discussions with Corey and Tim, I split up the interface for getting a page into the three interfaces: T229663 for page source, T234375 for offline use (HTML embedded in JSON), T234377 for online use (separates JSON and streaming HTML to pass to a processor that can show streaming HTML).

This endpoint would be like T234375, HTML embedded in JSON, so I've given it a similar URL template.

We don't want to have to version this API later after the PHP parser is deprecated.

I don't think we need to have a major version of the API when we go from the legacy PHP parser to Parsoid. There aren't features of the legacy output that aren't available with Parsoid, so I think it's backwards-compatible from the client point of view. (Going the other direction is, I think, a downgrade in features.)

For now, I'd like us to return the HTML from the default parser, and when Parsoid becomes the default parser, we return that instead.

@daniel , @eprodromou , does the task description still describe what we want for the initial rollout, or do we need to consider slot changes?

No slots for this initial rollout. We're just talking about slots to verify that we can add them later and that this design doesn't block using them.

I don't understand how it fits in the picture. We're doing it for the mobile app but we're returning desktop HTML?

Any thoughts or concerns there?

Please just return the HTML from the default parser.

Any thoughts or concerns there?

Please just return the HTML from the default parser.

The combined revision HTML, or the main slot HTML? The combined revision HTML is what we have in the parser cache. We currently don't have the main slot's HTML cached in isolation anywhere.

And, should we take the user preferences into account if there's a logged in user, or is it the HTML that anonymous user would see?

Please just return the HTML from the default parser.

The combined revision HTML, or the main slot HTML? The combined revision HTML is what we have in the parser cache. We currently don't have the main slot's HTML cached in isolation anywhere.

Great. That one.

And, should we take the user preferences into account if there's a logged in user, or is it the HTML that anonymous user would see?

anonymous user

eprodromou updated the task description. (Show Details)

I updated this with the requirements from the user story T231345.

Change 537221 merged by jenkins-bot:
[mediawiki/core@master] Core REST API handler for GET Page Revision

https://gerrit.wikimedia.org/r/537221

A quick heads up on permission checks when serving content, because I didn't see any permission checks in the patch:

The audience checks in RevisionStore and RevisionRecord are ONLY for "suppressed" old revisions. They do NOT check the regular 'read' permission, and never apply for the current revision. Any API that provides access to page content needs to check these permissions. This is also true when not exposing the content, but just the history.

The "BasicAuthorizer" in the REST stack isn't sufficient either. It will prevent access for non-authorized users to "private" wikis, serving as a first line of defense. But it does not perform per-page access checks. I would expect the REST module to perform the same check that ApiQueryRevision uses: $this->getPermissionManager()->userCan( 'read', $user, $title ).

On the patch @BPirkle asked:

Thank you Daniel. Was that comment informational, or did you see something we need to change?

I have not checked, but it seems to me this API now provides access to revision meta-data that the action API would not, in situations where access per-page access control has been applied, e.g. using the Lockdown or SimpleSecurity extension.

If that is the case, this should be fixed by applying the same permission check used by the equivalent module in the action API. This is a bit of an edge case - MediaWiki *implements* fine grained read access controls, but doesn't really *support* them. They are widely used on 3rd party wikis via extensions though, and suddenly leaking private data by intruducing an unchecked entry point seems like a bad idea.

This aside, it seems the handler would also allow reads on a completely private wiki; but as far as I can see, that would be blocked higher up, by BasicAuthorizer. But that should definitely be verified, otherwise we'd cause a *lot* of wikis to leak, including WMF's office wiki.

This only being about meta data, it's not yet highly critical. But that meta data can be sensitive as well, especially edit summaries.

Change 542450 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] REST: Improve access checks for history and revision handlers.

https://gerrit.wikimedia.org/r/542450

Change 542450 merged by jenkins-bot:
[mediawiki/core@master] REST History and compare endpoints followups.

https://gerrit.wikimedia.org/r/542450

I don't see size or delta in the output for this endpoint. Am I missing something?

I don't see size or delta in the output for this endpoint. Am I missing something?

Indeed. Seems like it got lost in the task description updates. Gimme a minute.

@Pchelolo does it make sense to open a new ticket or just leave this ticket open?

Change 545651 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] REST: add size and delta to revision metadata endpoint response.

https://gerrit.wikimedia.org/r/545651

@Pchelolo does it make sense to open a new ticket or just leave this ticket open?

See patch above.

Change 545651 merged by jenkins-bot:
[mediawiki/core@master] REST: add size and delta to revision metadata endpoint response.

https://gerrit.wikimedia.org/r/545651