Setup RESTBase HTML extract endpoint
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	Jdlrobson
	May 11 2017, 8:45 AM

Description

We currently use /api/rest_v1/page/summary/ which returns a plain text extract for display in page previews.

We want a version of the endpoint that returns HTML. For scalability reasons, this needs to be in RESTBase. The new version, v2.0.0, will wrap the existing mediawiki TextExtracts HTML endpoint.

We expect this new endpoint will resolve many of the existing issues e.g. bold highlighting.

AC

~~New endpoint marked as experimental (So other devs do not start relying on it)~~ This is already the case with the RESTBase summary endpoint.
The response is the same shape as the page summary API.
- This is so that existing clients don't have to change much to handle it.
The extract property is present and contains valid HTML.
The extract can contain li, ol, dl, b, and "mathematical expressions" (.mwe-math-element)
When a client requests a summary with the v1.1.2 MIME type (or below), we downgrade a v2.0.0 response by stripping the HTML from it

Info

In T165619#3272926, @phuedx wrote:
Document which textextracts request parameters to change for v2.0.0.

Drop explaintext=true from the current request parameters, i.e.
const parameters = {
  prop: 'extracts',
  exsentences: 5,
  exintro: true
};
e.g. https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=Monuments+(band)&exsentences=5&exintro=1&formatversion=2

In T165619#3272994, @phuedx wrote:
Specify/discuss any other major upcoming changes that can be done as part of the content type bump.

The intention is to whitelist HTML elements in the extract by tag name, CSS class, or ID so that the AC of T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles are satisfied. Since this only affects the extract property of the response, any change to the whitelist only corresponds to a minor version bump.

@GWicke: I'm not entirely sure how RESTBase migrates from one version of the response to another but it seems as if we'll need to store both the untransformed and transformed textextract response so that we can migrate between versions easily, i.e.
function migrate( response, _mostRecentVersion, targetVersion ) {
  const extract = response.rawExtract; // Some property that isn't sent to the client as part of the response.
  let whitelist;

  switch ( targetVersion ) {
    case 'v2.2.0':
      whitelist = [ '.foo', '.bar', '#baz' ];

    case 'v2.1.0':
      whitelist = [ '.foo', '.bar', '#baz', '.qux.quux' ]; // More restrictive than v2.2.0, for example.

    default:
      whitelist = [ '.foo', '.bar' ];
  }

  response.extract = transformExtract( extract, whitelist );

  return response;
}

Details

	Subject	Repo	Branch	Lines +/-
	Update accept header for summary endpoint	apps/android/wikipedia	master	+1 -1
	Set an accept: header on RESTBase summary endpoint requests	apps/android/wikipedia	master	+23 -9

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	None	T177425 Develop General Layer of PCS
Resolved	• Jhernandez	T177426 Develop structured JSON APIs for general consumption
Resolved	• Mholloway	T177431 Develop a Summary JSON API
Resolved	Dereckson	T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki)
Resolved	Jdlrobson	T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature
Resolved	ovasileva	T154635 [EPIC] Deploy page previews to English and German Wikipedia
Resolved	ovasileva	T192622 [EPIC] Page previews post-deploy cleanup
Resolved	Jdlrobson	T173952 Remove A/B testing instrumentation code
Duplicate	None	T167433 Switch all projects to the new (and yet to be built) summary-html endpoint for page previews
Duplicate	None	T167429 Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint
Resolved	ovasileva	T165018 Page previews can consume new summary-HTML endpoint
Declined	Jdlrobson	T111329 [GOAL] Page previews on mobileweb
Resolved	Jdlrobson	T164010 [EPIC] Strengthen the APIs we provide in reading web maintained extensions
Resolved	ovasileva	T113094 [EPIC] The Page Summary API needs to provide useful content for the majority of articles
Resolved	• Pchelolo	T165017 Setup RESTBase HTML extract endpoint
Resolved	Jdlrobson	T165619 Spike: Specify changes to page-summary endpoint
Declined	• Pchelolo	T166163 Create a summary content migration filter

Event Timeline

Jdlrobson created this task.May 11 2017, 8:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 11 2017, 8:45 AM

Jdlrobson mentioned this in T165018: Page previews can consume new summary-HTML endpoint.May 11 2017, 8:48 AM

Jdlrobson updated the task description. (Show Details)

Jdlrobson moved this task from Incoming to Upcoming on the Web-Team-Backlog board.May 11 2017, 8:58 AM

Jdlrobson updated the task description. (Show Details)

@ovasileva several of us had a chat today and we've proposed this and T165018 for the next sprint as a way of beginning this work. Contact me directly (irc/mail) if you have any questions!

@Jhernandez points out that this should be done by sending the RESTBase Page Summary endpoint a new Content-Type header, e.g. /wiki/Specs/Summary/2.0.0.

ovasileva added a project: Readers-Web-Kanbanana-Board-Old.May 17 2017, 10:57 AM

For clarification, are you planning to migrate the summary end point to return HTML summaries, or are you proposing to add a separate plain-text summary end point?

I could see a general migration to HTML work well, as converting HTML to plain text is relatively straightforward in clients. If backwards compatibility is a concern, then we could use the versioning mechanism to continue to provide plain-text summary responses to old clients.

In T165017#3270716, @GWicke wrote:

For clarification, are you planning to migrate the summary end point to return HTML summaries, or are you proposing to add a separate plain-text summary end point?

The former, ideally. IIRC Web and Apps folk both want to consume an endpoint that returns a string of well-formed but strictly limited HTML. What we'd need here is guidance/examples from Services about how we'd switch on the content type header etc.

What we'd need here is guidance/examples from Services about how we'd switch on the content type header etc.

The basic summary of the content format versioning strategy is at https://www.mediawiki.org/wiki/API_versioning#Content_format_stability_and_-negotiation.

This particular change would be potentially breaking for old clients expecting plain text summary content (v1.1.2 or earlier), so a new major version ending on /wiki/Specs/Summary/2.0.0 (as you propose) makes sense. Assuming current clients are indeed sending Accept headers asking for 1.x, we can continue to return conforming responses by stripping the HTML from the 2.0 responses in a RB migration handler. Stored 1.x content will be updated / re-rendered as 2.x. Clients that do not send accept headers will receive the latest (2.x) spec.

Does this make sense to you, or is there anything I am missing?

phuedx updated the task description. (Show Details)May 17 2017, 12:42 PM

In T165017#3270837, @GWicke wrote:

Does this make sense to you, or is there anything I am missing?

Thanks! This makes sense. I think that this approach is the (technically) correct approach and it comes with the cost of having to write the migration handler. So how do we get started?

phuedx updated the task description. (Show Details)May 17 2017, 1:09 PM

In T165017#3270941, @phuedx wrote:

So how do we get started?

We don't have a per-entrypoint migration handler / filter to use as an example yet. The closest we have is the generic content-type filter, which repeats requests with no-cache set when storage returns content from an older spec. The new migration filters will have a similar structure, but will do some end point specific massaging instead.

I think it makes sense for us (services) to take on the task of writing the migration filter. This is an opportunity for us to create a nice template for this type of task, which will become fairly common as end points get older.

On your end, it might make sense to consider whether you could bundle any other major upcoming changes into the content type bump, so that we don't need to bump the major version too often. We'll also need a bit of documentation on which textextracts request parameters to change for 2.0.

@Jdlrobson: Let's discuss this during kickoff ^

• Mholloway subscribed.May 17 2017, 3:57 PM

For the reading team to do here:

Specify/Discuss any other major upcoming changes into the content type bump
Documentation on which textextracts request parameters to change for 2.0

• Jhernandez created subtask T165619: Spike: Specify changes to page-summary endpoint.May 17 2017, 5:23 PM

• NHarateh_WMF set the point value for this task to 3.May 17 2017, 5:24 PM

We talked about this in the sprint but we were not all convinced it was a 3 pointer as we need to speak to services. We'll leave this in "needs analysis" until that has happened.

Jdlrobson moved this task from Upcoming to 2016-17 Q4 on the Web-Team-Backlog board.May 17 2017, 5:54 PM

@Mholloway, @Fjalapeno: In order to do this – introduce a new version of the summary endpoint, while maintaining compatibility with existing clients – we need to make sure that the apps include profile="https://www.mediawiki.org/wiki/Specs/Summary/1.1.2" in the Content-Type header in their requests to the endpoint. This is to make sure that you don't receive newer content, which'll include HTML in the extract.

Ping @JoeWalsh ^

Change 354136 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[apps/android/wikipedia@master] Set an accept: header on RESTBase summary endpoint requests

https://gerrit.wikimedia.org/r/354136

gerritbot added a project: Patch-For-Review.May 17 2017, 6:35 PM

Change 354136 merged by jenkins-bot:
[apps/android/wikipedia@master] Set an accept: header on RESTBase summary endpoint requests

https://gerrit.wikimedia.org/r/354136

phuedx mentioned this in T165619: Spike: Specify changes to page-summary endpoint.May 18 2017, 9:42 AM

• Jhernandez updated the task description. (Show Details)May 23 2017, 4:58 PM

^ I've added the following AC, which encodes the list in the description of T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.

@GWicke: When will Services be able to pick this up? Is it blocked on the iOS team checking whether the app is sending the correct Content-Type header?

Can we create an list of summary usages and see what happens with them if we start returning the html version by default?

Eventually we want to return the latest version by default and store the latest version. It would be interesting to see what happens with older iOS/Android apps that don't use the Accept header if they start seeing HTML inn the summary. cc @Fjalapeno @bearND

In case returning latest by default breaks older clients, we should treat this change as backwards-incompatible, return older version if the Accept header is not there and switch only when the amount of traffic with no header is super small. In that case we could want to introduce T164291 to the version 2 to kill 2 birds with one major version bump.

• Pchelolo created subtask T166163: Create a summary content migration filter.May 23 2017, 6:10 PM

@GWicke: When will Services be able to pick this up? Is it blocked on the iOS team checking whether the app is sending the correct Content-Type header?

There is a question regarding what to return by default one comment up. We need answer to that before deploying anything. Regarding writing the code for migration I've created a subtask T166163 to track my part of the work. ETA: some time later on this week.

In T165017#3287019, @Pchelolo wrote:

In case returning latest by default breaks older clients, we should treat this change as backwards-incompatible, return older version if the Accept header is not there and switch only when the amount of traffic with no header is super small. In that case we could want to introduce T164291 to the version 2 to kill 2 birds with one major version bump.

To me, this seems like the best route to introducing this service sooner rather than later while maintaining compatibility with older clients who aren't behaving well.

In T165017#3289352, @phuedx wrote:

In T165017#3287019, @Pchelolo wrote:

In case returning latest by default breaks older clients, we should treat this change as backwards-incompatible, return older version if the Accept header is not there and switch only when the amount of traffic with no header is super small. In that case we could want to introduce T164291 to the version 2 to kill 2 birds with one major version bump.

To me, this seems like the best route to introducing this service sooner rather than later while maintaining compatibility with older clients who aren't behaving well.

Another option for keeping backwards compat while avoiding violating the expectations set up in the API versioning document might be to temporarily look at the user agent string. When the request looks like those sent by older Android app versions (based on the user agent), we would rewrite the request in Varnish to ask for the old content type explicitly.

Jdlrobson closed subtask T165619: Spike: Specify changes to page-summary endpoint as Resolved.May 25 2017, 5:15 PM

Any updates on the new Restbase endpoint?

pmiazga awarded a token.May 31 2017, 5:07 PM

Jdlrobson moved this task from To Do to Blocked on Others on the Readers-Web-Kanbanana-Board-Old board.May 31 2017, 5:08 PM

@pmiazga We're having a meeting tomorrow to discuss migration strategy. I've sent you an invite if you're interested.

Great, thanks for invitation, of course I'll join.

Change 356429 had a related patch set uploaded (by BearND; owner: BearND):
[apps/android/wikipedia@master] Update accept header for summary endpoint

https://gerrit.wikimedia.org/r/356429

Change 356429 merged by jenkins-bot:
[apps/android/wikipedia@master] Update accept header for summary endpoint

https://gerrit.wikimedia.org/r/356429

• mobrovac edited projects, added RESTBase, RESTBase-API; removed Patch-For-Review.Jun 1 2017, 2:49 PM

• Pchelolo closed subtask T166163: Create a summary content migration filter as Declined.Jun 1 2017, 10:34 PM

A new extract_html property is added by this PR.

To be deployed on Monday June 05. We can't force-purge Varnish, so for 2 weeks some of the responses might still not have the new property. Please plan accordingly.

Mentioned in SAL (#wikimedia-operations) [2017-06-02T14:25:46Z] <mobrovac@tin> Started deploy [restbase/deploy@4b14527]: Add the extract_html property to the summary end point for T165017

Mentioned in SAL (#wikimedia-operations) [2017-06-02T14:32:29Z] <mobrovac@tin> Finished deploy [restbase/deploy@4b14527]: Add the extract_html property to the summary end point for T165017 (duration: 06m 43s)

The summary end point now contains the extract_html property. Note that for page summaries that are in cache and haven't been modified, the change will become visible only 2 weeks from now.

@pmiazga I assume you saw this?

@Pchelolo @mobrovac thanks for getting this implemented so quickly 💪

• mobrovac mentioned this in T168418: Page Previews should accept the service's notion of emptiness.Jun 23 2017, 5:08 PM

waldyrious mentioned this in T166272: HTML version of text extracts is not balanced/well formed and naive.Oct 27 2017, 11:36 PM

Setup RESTBase HTML extract endpointClosed, ResolvedPublic3 Estimated Story PointsActions