Page MenuHomePhabricator

Review Summary 2.0 Spec
Closed, ResolvedPublic

Description

Reading web is developing a spec for the new summary API.

@phuedx has documented the current spec here:
https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API

Reading Infrastructure should review the spec and ensure there are no major issues about the proposed changes. Apps Tech Leads should also take a quick pass in case anything is missing or breaking.

Note: Some there items will be added in regards to titles (see T164291) as well as structured information about disambiguation pages. Also the notion of "type" is still being worked on.

Related Objects

Event Timeline

@bearND can you take a quick look at this spec this week and note any issues you see with @phuedx's proposal?

I might be missing some context from earlier conversations here, though hopefully I'm mostly up to speed after last week's Reading API/Services sync. Is this new page summary API intended to supplement the existing API, or replace it? My recollection from last meeting is that supplementing the existing summary output and updating the existing extract_html logic was at least one viable option. The native apps generally have no use for HTML markup and would want to continue using separate plain text summaries.

(As an aside, I've been sitting out of the sync meetings the past couple of quarters to focus on Android work, but given the amount of potentially breaking work going on lately it seems advisable at this point for me to start attending regularly again.)

The new "type" property and the formal response/code definitions look good. I have some concerns around redirect treatment but would need to do some further investigation into what's happening on both server and client now.

@Mholloway Yeah sorry, thought you had some more context… But yeah I think you got the gist.

It is a replacement. So it needs to serve the needs of the current summary end point. If you have any new use cases that it should serve, that is good to talk about too.

As far as the html or plain text… the apps are only using plain text so you only have to think about the transforms that apply to plain text (like parenthetical stripping).

I THINK the redirect behavior that is listed is what already happens in RB but yeah, check for sure.

I see. Here's my understanding of where we are today: in too many cases, TextExtracts-based summaries provide content that is broken or otherwise not sufficient to meet the product requirements of Page Previews on the mobile web. Rather than trying to overhaul TextExtracts, a new service is being purpose-built to support Page Previews. This will likely be a standalone service (or an addition to MCS) rather than being served directly from RESTBase because of the amount of internal processing (i.e., content-munging) needed.

As long as there's a plaintext intro property for the apps similar to what we have today (which I see noted here), all of this looks good to me.

Flattening inline elements

@phuedx I did a minor edit to simplify the example. Feel free to revert if this doesn't capture what you were thinking.

Responses

NameTypeDescription
titleStringThe URL-encoded title of the page, e.g. File%3AIgorrr_(band).jpg
...
introStringThe intro of the page represented as well-formed HTML5.
plaintext_introStringThe intro of the page represented as plaintext.
...
wikidata_label?StringThe label of the Wikidata item.
  • The title property in most other places is not URL-encoded. I have not seen anything in T164291 to indicate that this should change. Example: https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACollage_of_Nine_Dogs.jpg should be File:Collage_of_Nine_Dogs.jpg but not File%3ACollage_of_Nine_Dogs.jpg. I think it probably doesn't hurt (need to check with the apps to be sure) but wanted to point out that this is a new thing.
  • plaintext_intro is a new property that the apps don't use yet and an alternative to the HTML version of intro the web is going to use. I'm not sure if this is needed if we don't have any actual users for this property.
  • Isn't wikidata_label usually the same as normalized_title? What is it going to be used for? When would it be different from normalized_title? If this is only set for Wikidata then that's fine. I just haven't seen anything that says that explicitly.
  • The Image type properties (thumbnail and original) are likely going to change in the future to use the new Thumbnail API once that is available (see T66214).

See the second comment in T164291#3253776. Having just re-read it though, I'm guess we don't want to URL-encode the namespace part. I'll retract the requirement for now until that's clarified.

See the second comment in T164291#3253776. Having just re-read it though, I'm guess we don't want to URL-encode the namespace part. I'll retract the requirement for now until that's clarified.

On second thought, the URL-encoding sounds good to me. Thank you for the pointer. We just need to make sure that the Android and the iOS app don't break.

@phuedx @bearND the url encoding discussion is interesting.

I think not having to do a transform on the client is a good idea BUT I don't like that it is ambiguous for clients about whether these properties are url encoded or not.

I have a proposal:

Add a new property: URL.
…and give the full URL to the resource. There is no ambiguity to whether it is URL encoded. It obviously is. This is also RESTful. And now the titles can just be titles.

Also this removes one more level of indirection… clients don't have to construct a URL, they already have it.

@phuedx @bearND the url encoding discussion is interesting.
I have a proposal:

Add a new property: URL.
…and give the full URL to the resource. There is no ambiguity to whether it is URL encoded. It obviously is. This is also RESTful. And now the titles can just be titles.

As you note, this is in line with other RESTful or hypermedia APIs. Would there ever be a case where you wanted to return both the non-localized and localized URL for the resource, e.g. Main_Page is Portada on cawiki?

The advantage of keeping this separate is that clients can chose which endpoint they want to use: summary vs read-html vs mobile-sections ... , or build a talk page or other namespaced URL, or use the MW API for things that are not supported by the REST API (editing, etc.).

This last example actually leads me to tend towards not URL-encoding the title property by the API. I also checked the Android app code. Before it makes a call to request something about a title it URL-encodes it. In the case of C++ it would then have that string doubly encoded, which would fail. C++ -> C%2B%2B -> C%252B%252B. If we did URL-encode this in the API we would have to mitigate that for the Android app somehow.

The advantage of keeping this separate is that clients can chose which endpoint they want to use: summary vs read-html vs mobile-sections ... , or build a talk page or other namespaced URL, or use the MW API for things that are not supported by the REST API (editing, etc.).

That's a good point. I think the main thing here is that we make it ABSOLUTELY clear if something in URL encoded or not.

Even if we have url encoded title for constructing other urls, that would be good
I would argue a URL for the resource itself is useful, regardless. If you want to return other properties as URL encoded name, you could do something like:

title:"File:Collage_of_Nine_Dogs.jpg"
url_encoded_title:"File%3ACollage_of_Nine_Dogs.jpg"
normalized_title:"Collage of Nine Dogs.jpg"
display_title: "<strong>Collage of Nine Dogs.jpg</strong>", 
url:"https://commons.wikimedia.org/wiki/File%3ACollage_of_Nine_Dogs.jpg"

It also occurs to me that we are still making the clients do string manipulation and have "knowledge" of the root URLs for different APIs. To make this absolute, we could also take it a step further and make the response fully descriptive and let clients know exactly how to obtain different representations of this resource:

title:"File:Collage_of_Nine_Dogs.jpg"
url_encoded_title:"File%3ACollage_of_Nine_Dogs.jpg"
normalized_title:"Collage of Nine Dogs.jpg"
display_title: "<strong>Collage of Nine Dogs.jpg</strong>"
resource_url:"https://commons.wikimedia.org/wiki/File%3ACollage_of_Nine_Dogs.jpg"
talk_url:"https://commons.wikimedia.org/wiki/File_talk%3ACollage_of_Nine_Dogs.jpg"
summary_url:"https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACollage_of_Nine_Dogs.jpg"
read_html_url:"https://commons.wikimedia.org/api/rest_v1/page/read_html/File%3ACollage_of_Nine_Dogs.jpg"
edit_html_url:"https://commons.wikimedia.org/api/rest_v1/page/html/File%3ACollage_of_Nine_Dogs.jpg"

Also… I think its important to look at the talk URL for this one. Clients would need to know the special rule for file talk pages… it is "File_talk:" not "Talk:". So providing this information to the client removes the need for clients to implement all kinds of special logic in constructing urls.

Is this a lot of information to deliver? Yes. But that's because we have a complex system - we deal with many languages and don't have the benefit of consistent global ids (page ids just don't work). I'm also less concerned with size than I am with being able to reliably construct urls without bugs for both us and 3rd party clients (And gzip compression will take care of much of this due to the duplicative nature of the data).

A general +1 to T169761#3422369.

Further to the idea of including a "self link" in the response, I think this SO answer covers the reasoning behind it quite well.


To bikeshed just a little more, we might consider grouping these properties to make the response a little saner. There are a lot of properties with repeated suffixes, which to me implies a collection, e.g.

talk_url
summary_url
read_html_url
edit_html_url

Could become:

urls: {
  talk: "...",
  summary: "...",
  read_html: "...",
  edit_html: "..."
}

Similarly, *_title could be collected under the titles property.

  • plaintext_intro is a new property that the apps don't use yet and an alternative to the HTML version of intro the web is going to use. I'm not sure if this is needed if we don't have any actual users for this property.

Do the apps currently consume the RESTBase page summary endpoint? If so, they're expecting the extract property to be plaintext, right? Perhaps I should rename the property.

  • Isn't wikidata_label usually the same as normalized_title? What is it going to be used for? When would it be different from normalized_title? If this is only set for Wikidata then that's fine. I just haven't seen anything that says that explicitly.

For https://www.wikidata.org/wiki/Q1, the normalized_title would be "Q1" and the wikidata_label would be "universe". See https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&prop=info%7Cpageterms&list=&titles=Q1&inprop=&wbptterms=label%7Cdescription for a f'rinstance.

You're right that this property should only be set when the type property is "wikidata_preview". I'll update the spec.

Do the apps currently consume the RESTBase page summary endpoint? If so, they're expecting the extract property to be plaintext, right? Perhaps I should rename the property.

@phuedx yeah the apps currently the summary end point (and the feed endpoints use the summary end point internally).

The apps only use plaintext for now as was alluded to by @Mholloway. For the apps, the big thing about the plaintext version is that we get the same output as the HTML version of the intro but without styling and links (but with all the nice stripping of parenthetical content, etc…).

As far as property naming and grouping, that all sounds good to me… the big thing we will have to deal with is migration for the apps. We do know that the title changes are breaking so it might not matter if we do anymore breaking changes TBH… but we need to come up with the migration strategy first and then work backwards.

@JoeWalsh can you verify that the iOS is using the versioning header for the summary API calls?

FYI: here is a ticket for adding ORES scores to the summary:
T157132: Add ORES articlequality data to summaries?

Just to be on the radar

Thanks for explanation of the wikidata case. That makes sense now.

+1 to grouping the urls. In the long term I'm in favor of grouping the titles, too. That is backwards compatibility breaking; same for extract -> plaintext_intro.

@Fjalapeno looks like no - only some of the requests have the proper Accept header

@phuedx noticed that the "description" property is missing from the spec (wikidata description)

Some bikeshedding on the "type" (some we talked about in a meeting, some new):

The notional type of the intro. One of "generic", "disambiguation", "wikidata_preview", or "preview".

I'd probably remove the "preview" suffix from wikidata since they are all previews

Also I'd change the name of "preview" to smoothing like "standard"

Also, "generic" I feel like needs a good example still. Maybe you said you would remove that, I don't remember.

If it is just to say "there is no intro section, that seems unnecessary as the info section would be blank anyways. And clients could just decide how to render any preview that comes back with a blank intro section without expressing this as a type I think

@phuedx noticed that the "description" property is missing from the spec (wikidata description)

Quoting §6.1 of the spec:

The intro property of the response must be set to the item's description.

Quoting §6.1 of the spec:

The intro property of the response must be set to the item's description.

@phuedx The apps use the wikidata description of ALL pages for display (depending on context).
So in your new response, when you have type="preview", the apps would still display the wikidata description separately.
For example, here you can see the wikidata description displayed AND an extract:

IMG_1704DF546657-1.jpeg (2×1 px, 600 KB)

Does this make sense, or am I misinterpreting your doc?

Some bikeshedding on the "type" (some we talked about in a meeting, some new):

I think the bikeshed's lookin' pretty darn fine…

The notional type of the intro. One of "generic", "disambiguation", "wikidata_preview", or "preview".

I'd probably remove the "preview" suffix from wikidata since they are all previews

Done.

Also I'd change the name of "preview" to smoothing like "standard"

Done.

Also, "generic" I feel like needs a good example still. Maybe you said you would remove that, I don't remember.

If it is just to say "there is no intro section, that seems unnecessary as the info section would be blank anyways. And clients could just decide how to render any preview that comes back with a blank intro section without expressing this as a type I think

That's a good point. I've removed it. Done!