Page MenuHomePhabricator

Setup RESTBase HTML extract endpoint
Closed, ResolvedPublic3 Estimated Story Points

Description

We currently use /api/rest_v1/page/summary/ which returns a plain text extract for display in page previews.

We want a version of the endpoint that returns HTML. For scalability reasons, this needs to be in RESTBase. The new version, v2.0.0, will wrap the existing mediawiki TextExtracts HTML endpoint.

We expect this new endpoint will resolve many of the existing issues e.g. bold highlighting.

AC

  • New endpoint marked as experimental (So other devs do not start relying on it) This is already the case with the RESTBase summary endpoint.
  • The response is the same shape as the page summary API.
    • This is so that existing clients don't have to change much to handle it.
  • The extract property is present and contains valid HTML.
  • The extract can contain li, ol, dl, b, and "mathematical expressions" (.mwe-math-element)
  • When a client requests a summary with the v1.1.2 MIME type (or below), we downgrade a v2.0.0 response by stripping the HTML from it

Info

Document which textextracts request parameters to change for v2.0.0.

Drop explaintext=true from the current request parameters, i.e.

const parameters = {
  prop: 'extracts',
  exsentences: 5,
  exintro: true
};

e.g. https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=Monuments+(band)&exsentences=5&exintro=1&formatversion=2

Specify/discuss any other major upcoming changes that can be done as part of the content type bump.

The intention is to whitelist HTML elements in the extract by tag name, CSS class, or ID so that the AC of T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles are satisfied. Since this only affects the extract property of the response, any change to the whitelist only corresponds to a minor version bump.

@GWicke: I'm not entirely sure how RESTBase migrates from one version of the response to another but it seems as if we'll need to store both the untransformed and transformed textextract response so that we can migrate between versions easily, i.e.

function migrate( response, _mostRecentVersion, targetVersion ) {
  const extract = response.rawExtract; // Some property that isn't sent to the client as part of the response.
  let whitelist;

  switch ( targetVersion ) {
    case 'v2.2.0':
      whitelist = [ '.foo', '.bar', '#baz' ];

    case 'v2.1.0':
      whitelist = [ '.foo', '.bar', '#baz', '.qux.quux' ]; // More restrictive than v2.2.0, for example.

    default:
      whitelist = [ '.foo', '.bar' ];
  }

  response.extract = transformExtract( extract, whitelist );

  return response;
}

Related Objects

Event Timeline

@ovasileva several of us had a chat today and we've proposed this and T165018 for the next sprint as a way of beginning this work. Contact me directly (irc/mail) if you have any questions!

@Jhernandez points out that this should be done by sending the RESTBase Page Summary endpoint a new Content-Type header, e.g. /wiki/Specs/Summary/2.0.0.

For clarification, are you planning to migrate the summary end point to return HTML summaries, or are you proposing to add a separate plain-text summary end point?

I could see a general migration to HTML work well, as converting HTML to plain text is relatively straightforward in clients. If backwards compatibility is a concern, then we could use the versioning mechanism to continue to provide plain-text summary responses to old clients.

For clarification, are you planning to migrate the summary end point to return HTML summaries, or are you proposing to add a separate plain-text summary end point?

The former, ideally. IIRC Web and Apps folk both want to consume an endpoint that returns a string of well-formed but strictly limited HTML. What we'd need here is guidance/examples from Services about how we'd switch on the content type header etc.

What we'd need here is guidance/examples from Services about how we'd switch on the content type header etc.

The basic summary of the content format versioning strategy is at https://www.mediawiki.org/wiki/API_versioning#Content_format_stability_and_-negotiation.

This particular change would be potentially breaking for old clients expecting plain text summary content (v1.1.2 or earlier), so a new major version ending on /wiki/Specs/Summary/2.0.0 (as you propose) makes sense. Assuming current clients are indeed sending Accept headers asking for 1.x, we can continue to return conforming responses by stripping the HTML from the 2.0 responses in a RB migration handler. Stored 1.x content will be updated / re-rendered as 2.x. Clients that do not send accept headers will receive the latest (2.x) spec.

Does this make sense to you, or is there anything I am missing?

Does this make sense to you, or is there anything I am missing?

Thanks! This makes sense. I think that this approach is the (technically) correct approach and it comes with the cost of having to write the migration handler. So how do we get started?

So how do we get started?

We don't have a per-entrypoint migration handler / filter to use as an example yet. The closest we have is the generic content-type filter, which repeats requests with no-cache set when storage returns content from an older spec. The new migration filters will have a similar structure, but will do some end point specific massaging instead.

I think it makes sense for us (services) to take on the task of writing the migration filter. This is an opportunity for us to create a nice template for this type of task, which will become fairly common as end points get older.

On your end, it might make sense to consider whether you could bundle any other major upcoming changes into the content type bump, so that we don't need to bump the major version too often. We'll also need a bit of documentation on which textextracts request parameters to change for 2.0.

@Jdlrobson: Let's discuss this during kickoff ^

For the reading team to do here:

  • Specify/Discuss any other major upcoming changes into the content type bump
  • Documentation on which textextracts request parameters to change for 2.0

We talked about this in the sprint but we were not all convinced it was a 3 pointer as we need to speak to services. We'll leave this in "needs analysis" until that has happened.

@Mholloway, @Fjalapeno: In order to do this – introduce a new version of the summary endpoint, while maintaining compatibility with existing clients – we need to make sure that the apps include profile="https://www.mediawiki.org/wiki/Specs/Summary/1.1.2" in the Content-Type header in their requests to the endpoint. This is to make sure that you don't receive newer content, which'll include HTML in the extract.

Change 354136 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[apps/android/wikipedia@master] Set an accept: header on RESTBase summary endpoint requests

https://gerrit.wikimedia.org/r/354136

Change 354136 merged by jenkins-bot:
[apps/android/wikipedia@master] Set an accept: header on RESTBase summary endpoint requests

https://gerrit.wikimedia.org/r/354136

@GWicke: When will Services be able to pick this up? Is it blocked on the iOS team checking whether the app is sending the correct Content-Type header?

Can we create an list of summary usages and see what happens with them if we start returning the html version by default?

Eventually we want to return the latest version by default and store the latest version. It would be interesting to see what happens with older iOS/Android apps that don't use the Accept header if they start seeing HTML inn the summary. cc @Fjalapeno @bearND

In case returning latest by default breaks older clients, we should treat this change as backwards-incompatible, return older version if the Accept header is not there and switch only when the amount of traffic with no header is super small. In that case we could want to introduce T164291 to the version 2 to kill 2 birds with one major version bump.

@GWicke: When will Services be able to pick this up? Is it blocked on the iOS team checking whether the app is sending the correct Content-Type header?

There is a question regarding what to return by default one comment up. We need answer to that before deploying anything. Regarding writing the code for migration I've created a subtask T166163 to track my part of the work. ETA: some time later on this week.

In case returning latest by default breaks older clients, we should treat this change as backwards-incompatible, return older version if the Accept header is not there and switch only when the amount of traffic with no header is super small. In that case we could want to introduce T164291 to the version 2 to kill 2 birds with one major version bump.

To me, this seems like the best route to introducing this service sooner rather than later while maintaining compatibility with older clients who aren't behaving well.

In case returning latest by default breaks older clients, we should treat this change as backwards-incompatible, return older version if the Accept header is not there and switch only when the amount of traffic with no header is super small. In that case we could want to introduce T164291 to the version 2 to kill 2 birds with one major version bump.

To me, this seems like the best route to introducing this service sooner rather than later while maintaining compatibility with older clients who aren't behaving well.

Another option for keeping backwards compat while avoiding violating the expectations set up in the API versioning document might be to temporarily look at the user agent string. When the request looks like those sent by older Android app versions (based on the user agent), we would rewrite the request in Varnish to ask for the old content type explicitly.

Any updates on the new Restbase endpoint?

@pmiazga We're having a meeting tomorrow to discuss migration strategy. I've sent you an invite if you're interested.

Great, thanks for invitation, of course I'll join.

Change 356429 had a related patch set uploaded (by BearND; owner: BearND):
[apps/android/wikipedia@master] Update accept header for summary endpoint

https://gerrit.wikimedia.org/r/356429

Change 356429 merged by jenkins-bot:
[apps/android/wikipedia@master] Update accept header for summary endpoint

https://gerrit.wikimedia.org/r/356429

A new extract_html property is added by this PR.

To be deployed on Monday June 05. We can't force-purge Varnish, so for 2 weeks some of the responses might still not have the new property. Please plan accordingly.

Mentioned in SAL (#wikimedia-operations) [2017-06-02T14:25:46Z] <mobrovac@tin> Started deploy [restbase/deploy@4b14527]: Add the extract_html property to the summary end point for T165017

Mentioned in SAL (#wikimedia-operations) [2017-06-02T14:32:29Z] <mobrovac@tin> Finished deploy [restbase/deploy@4b14527]: Add the extract_html property to the summary end point for T165017 (duration: 06m 43s)

mobrovac assigned this task to Pchelolo.
mobrovac edited projects, added Services (done); removed Services (watching).

The summary end point now contains the extract_html property. Note that for page summaries that are in cache and haven't been modified, the change will become visible only 2 weeks from now.

@Pchelolo @mobrovac thanks for getting this implemented so quickly 💪