Review Summary 2.0 Spec
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Fjalapeno
	Jul 5 2017, 3:48 PM

Description

Reading web is developing a spec for the new summary API.

@phuedx has documented the current spec here:
https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API

Reading Infrastructure should review the spec and ensure there are no major issues about the proposed changes. Apps Tech Leads should also take a quick pass in case anything is missing or breaking.

Note: Some there items will be added in regards to titles (see T164291) as well as structured information about disambiguation pages. Also the notion of "type" is still being worked on.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	None	T177425 Develop General Layer of PCS
Resolved	• Jhernandez	T177426 Develop structured JSON APIs for general consumption
Resolved	• Mholloway	T177431 Develop a Summary JSON API
Resolved	Dereckson	T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki)
Resolved	Jdlrobson	T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature
Resolved	ovasileva	T154635 [EPIC] Deploy page previews to English and German Wikipedia
Resolved	ovasileva	T192622 [EPIC] Page previews post-deploy cleanup
Resolved	Jdlrobson	T173952 Remove A/B testing instrumentation code
Duplicate	None	T167433 Switch all projects to the new (and yet to be built) summary-html endpoint for page previews
Duplicate	None	T167429 Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint
Resolved	ovasileva	T165018 Page previews can consume new summary-HTML endpoint
Declined	Jdlrobson	T111329 [GOAL] Page previews on mobileweb
Resolved	Jdlrobson	T164010 [EPIC] Strengthen the APIs we provide in reading web maintained extensions
Resolved	ovasileva	T113094 [EPIC] The Page Summary API needs to provide useful content for the majority of articles
Resolved	• Fjalapeno	T167022 Inventory requirements for summary 2.0 endpoint across Reading teams
Resolved	• bearND	T169761 Review Summary 2.0 Spec

Event Timeline

• Fjalapeno created this task.Jul 5 2017, 3:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 5 2017, 3:48 PM

• Fjalapeno added a parent task: T167022: Inventory requirements for summary 2.0 endpoint across Reading teams.Jul 5 2017, 3:48 PM

Jdlrobson moved this task from Incoming to Tracking on the Web-Team-Backlog board.Jul 5 2017, 3:49 PM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

@bearND can you take a quick look at this spec this week and note any issues you see with @phuedx's proposal?

I might be missing some context from earlier conversations here, though hopefully I'm mostly up to speed after last week's Reading API/Services sync. Is this new page summary API intended to supplement the existing API, or replace it? My recollection from last meeting is that supplementing the existing summary output and updating the existing extract_html logic was at least one viable option. The native apps generally have no use for HTML markup and would want to continue using separate plain text summaries.

(As an aside, I've been sitting out of the sync meetings the past couple of quarters to focus on Android work, but given the amount of potentially breaking work going on lately it seems advisable at this point for me to start attending regularly again.)

The new "type" property and the formal response/code definitions look good. I have some concerns around redirect treatment but would need to do some further investigation into what's happening on both server and client now.

@Mholloway Yeah sorry, thought you had some more context… But yeah I think you got the gist.

It is a replacement. So it needs to serve the needs of the current summary end point. If you have any new use cases that it should serve, that is good to talk about too.

As far as the html or plain text… the apps are only using plain text so you only have to think about the transforms that apply to plain text (like parenthetical stripping).

I THINK the redirect behavior that is listed is what already happens in RB but yeah, check for sure.

phuedx updated the task description. (Show Details)Jul 6 2017, 9:01 AM

I see. Here's my understanding of where we are today: in too many cases, TextExtracts-based summaries provide content that is broken or otherwise not sufficient to meet the product requirements of Page Previews on the mobile web. Rather than trying to overhaul TextExtracts, a new service is being purpose-built to support Page Previews. This will likely be a standalone service (or an addition to MCS) rather than being served directly from RESTBase because of the amount of internal processing (i.e., content-munging) needed.

As long as there's a plaintext intro property for the apps similar to what we have today (which I see noted here), all of this looks good to me.

pmiazga subscribed.Jul 6 2017, 10:05 PM

• bearND updated the task description. (Show Details)Jul 10 2017, 3:24 PM

Jdlrobson moved this task from Untriaged to Move to Backlog on the Web-Team-Backlog (Tracking) board.Jul 10 2017, 3:50 PM

Flattening inline elements

@phuedx I did a minor edit to simplify the example. Feel free to revert if this doesn't capture what you were thinking.

Responses

Name	Type	Description
title	String	The URL-encoded title of the page, e.g. File%3AIgorrr_(band).jpg
...
intro	String	The intro of the page represented as well-formed HTML5.
plaintext_intro	String	The intro of the page represented as plaintext.
...
wikidata_label	?String	The label of the Wikidata item.

The title property in most other places is not URL-encoded. I have not seen anything in T164291 to indicate that this should change. Example: https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACollage_of_Nine_Dogs.jpg should be File:Collage_of_Nine_Dogs.jpg but not File%3ACollage_of_Nine_Dogs.jpg. I think it probably doesn't hurt (need to check with the apps to be sure) but wanted to point out that this is a new thing.
plaintext_intro is a new property that the apps don't use yet and an alternative to the HTML version of intro the web is going to use. I'm not sure if this is needed if we don't have any actual users for this property.
Isn't wikidata_label usually the same as normalized_title? What is it going to be used for? When would it be different from normalized_title? If this is only set for Wikidata then that's fine. I just haven't seen anything that says that explicitly.
The Image type properties (thumbnail and original) are likely going to change in the future to use the new Thumbnail API once that is available (see T66214).

In T169761#3421519, @bearND wrote:

The title property in most other places is not URL-encoded. I have not seen anything in T164291 to indicate that this should change. Example: https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACollage_of_Nine_Dogs.jpg should be File:Collage_of_Nine_Dogs.jpg but not File%3ACollage_of_Nine_Dogs.jpg. I think it probably doesn't hurt (need to check with the apps to be sure) but wanted to point out that this is a new thing.

See the second comment in T164291#3253776. Having just re-read it though, I'm guess we don't want to URL-encode the namespace part. I'll retract the requirement for now until that's clarified.

Also, thanks for the review @bearND!

In T169761#3421556, @phuedx wrote:

See the second comment in T164291#3253776. Having just re-read it though, I'm guess we don't want to URL-encode the namespace part. I'll retract the requirement for now until that's clarified.

On second thought, the URL-encoding sounds good to me. Thank you for the pointer. We just need to make sure that the Android and the iOS app don't break.

@phuedx @bearND the url encoding discussion is interesting.

I think not having to do a transform on the client is a good idea BUT I don't like that it is ambiguous for clients about whether these properties are url encoded or not.

I have a proposal:

Add a new property: URL.
…and give the full URL to the resource. There is no ambiguity to whether it is URL encoded. It obviously is. This is also RESTful. And now the titles can just be titles.

Also this removes one more level of indirection… clients don't have to construct a URL, they already have it.

• Fjalapeno moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jul 10 2017, 5:45 PM

In T169761#3421832, @Fjalapeno wrote:

@phuedx @bearND the url encoding discussion is interesting.
I have a proposal:

Add a new property: URL.
…and give the full URL to the resource. There is no ambiguity to whether it is URL encoded. It obviously is. This is also RESTful. And now the titles can just be titles.

As you note, this is in line with other RESTful or hypermedia APIs. Would there ever be a case where you wanted to return both the non-localized and localized URL for the resource, e.g. Main_Page is Portada on cawiki?

The advantage of keeping this separate is that clients can chose which endpoint they want to use: summary vs read-html vs mobile-sections ... , or build a talk page or other namespaced URL, or use the MW API for things that are not supported by the REST API (editing, etc.).

This last example actually leads me to tend towards not URL-encoding the title property by the API. I also checked the Android app code. Before it makes a call to request something about a title it URL-encodes it. In the case of C++ it would then have that string doubly encoded, which would fail. C++ -> C%2B%2B -> C%252B%252B. If we did URL-encode this in the API we would have to mitigate that for the Android app somehow.

The advantage of keeping this separate is that clients can chose which endpoint they want to use: summary vs read-html vs mobile-sections ... , or build a talk page or other namespaced URL, or use the MW API for things that are not supported by the REST API (editing, etc.).

That's a good point. I think the main thing here is that we make it ABSOLUTELY clear if something in URL encoded or not.

Even if we have url encoded title for constructing other urls, that would be good
I would argue a URL for the resource itself is useful, regardless. If you want to return other properties as URL encoded name, you could do something like:

title:"File:Collage_of_Nine_Dogs.jpg"
url_encoded_title:"File%3ACollage_of_Nine_Dogs.jpg"
normalized_title:"Collage of Nine Dogs.jpg"
display_title: "<strong>Collage of Nine Dogs.jpg</strong>", 
url:"https://commons.wikimedia.org/wiki/File%3ACollage_of_Nine_Dogs.jpg"

It also occurs to me that we are still making the clients do string manipulation and have "knowledge" of the root URLs for different APIs. To make this absolute, we could also take it a step further and make the response fully descriptive and let clients know exactly how to obtain different representations of this resource:

title:"File:Collage_of_Nine_Dogs.jpg"
url_encoded_title:"File%3ACollage_of_Nine_Dogs.jpg"
normalized_title:"Collage of Nine Dogs.jpg"
display_title: "<strong>Collage of Nine Dogs.jpg</strong>"
resource_url:"https://commons.wikimedia.org/wiki/File%3ACollage_of_Nine_Dogs.jpg"
talk_url:"https://commons.wikimedia.org/wiki/File_talk%3ACollage_of_Nine_Dogs.jpg"
summary_url:"https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACollage_of_Nine_Dogs.jpg"
read_html_url:"https://commons.wikimedia.org/api/rest_v1/page/read_html/File%3ACollage_of_Nine_Dogs.jpg"
edit_html_url:"https://commons.wikimedia.org/api/rest_v1/page/html/File%3ACollage_of_Nine_Dogs.jpg"

Also… I think its important to look at the talk URL for this one. Clients would need to know the special rule for file talk pages… it is "File_talk:" not "Talk:". So providing this information to the client removes the need for clients to implement all kinds of special logic in constructing urls.

Is this a lot of information to deliver? Yes. But that's because we have a complex system - we deal with many languages and don't have the benefit of consistent global ids (page ids just don't work). I'm also less concerned with size than I am with being able to reliably construct urls without bugs for both us and 3rd party clients (And gzip compression will take care of much of this due to the duplicative nature of the data).

A general +1 to T169761#3422369.

Further to the idea of including a "self link" in the response, I think this SO answer covers the reasoning behind it quite well.

To bikeshed just a little more, we might consider grouping these properties to make the response a little saner. There are a lot of properties with repeated suffixes, which to me implies a collection, e.g.

talk_url
summary_url
read_html_url
edit_html_url

Could become:

urls: {
  talk: "...",
  summary: "...",
  read_html: "...",
  edit_html: "..."
}

Similarly, *_title could be collected under the titles property.

In T169761#3421519, @bearND wrote:

plaintext_intro is a new property that the apps don't use yet and an alternative to the HTML version of intro the web is going to use. I'm not sure if this is needed if we don't have any actual users for this property.

Do the apps currently consume the RESTBase page summary endpoint? If so, they're expecting the extract property to be plaintext, right? Perhaps I should rename the property.

In T169761#3421519, @bearND wrote:

Isn't wikidata_label usually the same as normalized_title? What is it going to be used for? When would it be different from normalized_title? If this is only set for Wikidata then that's fine. I just haven't seen anything that says that explicitly.

For https://www.wikidata.org/wiki/Q1, the normalized_title would be "Q1" and the wikidata_label would be "universe". See https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&prop=info%7Cpageterms&list=&titles=Q1&inprop=&wbptterms=label%7Cdescription for a f'rinstance.

You're right that this property should only be set when the type property is "wikidata_preview". I'll update the spec.

Do the apps currently consume the RESTBase page summary endpoint? If so, they're expecting the extract property to be plaintext, right? Perhaps I should rename the property.

@phuedx yeah the apps currently the summary end point (and the feed endpoints use the summary end point internally).

The apps only use plaintext for now as was alluded to by @Mholloway. For the apps, the big thing about the plaintext version is that we get the same output as the HTML version of the intro but without styling and links (but with all the nice stripping of parenthetical content, etc…).

As far as property naming and grouping, that all sounds good to me… the big thing we will have to deal with is migration for the apps. We do know that the title changes are breaking so it might not matter if we do anymore breaking changes TBH… but we need to come up with the migration strategy first and then work backwards.

@JoeWalsh can you verify that the iOS is using the versioning header for the summary API calls?

FYI: here is a ticket for adding ORES scores to the summary:
T157132: Add ORES articlequality data to summaries?

Just to be on the radar

Thanks for explanation of the wikidata case. That makes sense now.

+1 to grouping the urls. In the long term I'm in favor of grouping the titles, too. That is backwards compatibility breaking; same for extract -> plaintext_intro.

@Fjalapeno looks like no - only some of the requests have the proper Accept header

Thanks @JoeWalsh!

@phuedx Joe created a ticket to add this here: T170291: Add `Accept` header to all article summary requests

• Fjalapeno mentioned this in T167022: Inventory requirements for summary 2.0 endpoint across Reading teams.Jul 12 2017, 3:10 PM

@phuedx noticed that the "description" property is missing from the spec (wikidata description)

@phuedx and also pageid

Some bikeshedding on the "type" (some we talked about in a meeting, some new):

The notional type of the intro. One of "generic", "disambiguation", "wikidata_preview", or "preview".

I'd probably remove the "preview" suffix from wikidata since they are all previews

Also I'd change the name of "preview" to smoothing like "standard"

Also, "generic" I feel like needs a good example still. Maybe you said you would remove that, I don't remember.

If it is just to say "there is no intro section, that seems unnecessary as the info section would be blank anyways. And clients could just decide how to render any preview that comes back with a blank intro section without expressing this as a type I think

In T169761#3431177, @Fjalapeno wrote:

@phuedx noticed that the "description" property is missing from the spec (wikidata description)

Quoting §6.1 of the spec:

The intro property of the response must be set to the item's description.

In T169761#3431178, @Fjalapeno wrote:

@phuedx and also pageid

Done.

Quoting §6.1 of the spec:

The intro property of the response must be set to the item's description.

@phuedx The apps use the wikidata description of ALL pages for display (depending on context).
So in your new response, when you have type="preview", the apps would still display the wikidata description separately.
For example, here you can see the wikidata description displayed AND an extract:

IMG_1704DF546657-1.jpeg (2×1 px, 600 KB)

Does this make sense, or am I misinterpreting your doc?

In T169761#3431233, @Fjalapeno wrote:

Some bikeshedding on the "type" (some we talked about in a meeting, some new):

I think the bikeshed's lookin' pretty darn fine…

The notional type of the intro. One of "generic", "disambiguation", "wikidata_preview", or "preview".

I'd probably remove the "preview" suffix from wikidata since they are all previews

Done.

Also I'd change the name of "preview" to smoothing like "standard"

Done.

Also, "generic" I feel like needs a good example still. Maybe you said you would remove that, I don't remember.

If it is just to say "there is no intro section, that seems unnecessary as the info section would be blank anyways. And clients could just decide how to render any preview that comes back with a blank intro section without expressing this as a type I think

That's a good point. I've removed it. Done!

• Fjalapeno mentioned this in T151241: Surface disambiguation page flag in response.Jul 13 2017, 2:30 PM

• Fjalapeno moved this task from Doing to Sign off on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

phuedx mentioned this in T168400: [EPIC] Add notion of emptiness to page summary API.Jul 14 2017, 9:43 AM

phuedx mentioned this in T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.Jul 14 2017, 10:07 AM

• Fjalapeno mentioned this in T170692: Return common URLs in summary API so clients do not have to perform bug prone string manipulation .Jul 14 2017, 4:29 PM

• Fjalapeno added a project: Reading Epics (Page Previews).

• Fjalapeno closed this task as Resolved.Oct 6 2017, 4:06 PM