Page MenuHomePhabricator

Cached REST endpoint for extracts requests
Closed, ResolvedPublic

Description

The extracts endpoint is one of the most popular query endpoint in MediaWiki PHP API with on average 450 req/s. The information returned by the endpoint is fairly static and cacheable. The cache invalidation conditions are the same as for Parsoid html data, and are already tracked by RESTBase. With Varnish-cachable REST API we could provide significant improvement in response latency and lower the overall load on the PHP API.

However, the PHP endpoint provides multiple options, which makes it hard to convert it to REST. This task is created to discuss which of the options are useful, and which could be dropped.

Currently, following options are supported:

  • exchars - how many characters are returned. This could be easily dropped out, as the full returned extract could be easily stripped on the client. However, dropping out this option makes the client load up the whole content, which might be inefficient, if this number is normally significantly lower than the content length. In case this is true, we could provide a number of options for returned content length, like short, long, full.
  • exsentences - How many sentences to return. Mostly identical to the previous one.
  • exlimit - How many extracts to return. REST API wouldn't allow batch requests, so this option would be definitely drooped.
  • exintro - Return only content before the first section. If it's needed, this could become a separate endpoint. However, to support it, RESTBase would need to make a second request to the PHP API on invalidation, so ideally this should be dropped as well.
  • explaintext - Return extracts as plain text instead of limited HTML. In case this is used, the REST endpoint could provide a format parameter. However, we should consider returning only simplified HTML, as filtering-out all the tags is a fairly easy operation to do it on the client.
  • exsectionformat - How to format sections in plaintext mode: plain, wikitext, raw. Likely to be dropped out, as RESTBase doesn't serve wikitext, and raw doesn't seem to be useful for clients.
  • excontinue - When more results are available, use this to continue. As batch requests wouldn't be supported, this would be dropped.
  • exvariant - Convert content into this language variant. The language variant would be controlled by the request domain, so this would also be dropped.

In case none of the options are strictly required, the API could look like this:

  • /page/extract/{title} - returns the page extract for the latest title
  • /page/extract/{title}/ - lists all revision/tid pairs available in storage for text extracts
  • /page/extract/{title}{/revision}{/tid} - get an exact for a historic revision of a page

This task is created to accumulate the information about real usages of the API and decide on the options needed by the clients.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bearND added a comment.Nov 5 2015, 6:04 PM

BTW, I'll need to update the mobile content service to bring it in line with the app request:
https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/mwapi.js#L147-L158

We do some munging of the text extracts response in the service, too. See https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/extract.js

phuedx added a comment.Nov 5 2015, 6:05 PM

@GWicke: Given that the only parameter that varies is exsentences and the above rationale for using a value of 10 seems reasonable, I think that the URLs that's included in the task's description are fine by me.

phuedx added a comment.Nov 5 2015, 6:13 PM

@bearND: It looks like there's a whole bunch of stuff in extract.js that's worth upstreaming to TextExtracts.

bearND added a comment.Nov 5 2015, 6:24 PM

@phuedx That would be wonderful. But I'm not sure how that would work since it changes current behavior. Would that require a new parameter?

bearND added a comment.EditedNov 5 2015, 6:38 PM

Looks like the app is no longer needing the Wikidata description for the link previews. We're going to add some more experiments of different link preview versions versions: T117065.
Some of the designs don't use a thumbnail (actually what's currently on master) but that could change, so I'm leaving the thumbnail in for now.

Thank you all for the feedback. From what I can see, we can definitely drop most of the endpoint options, and we're now left with just 3 parameters we still need to think about:

  • explaintext - looks like the plain format is the only one used, however it's shouldn't be too hard to provide both simple html and plain text, so it's still an open question whether to provide both.
  • exintro - @bmansurov pointed, that this parameter is used in cards, but together with exsentences. I'm quite sure this could be dropped, as it's impossible to have one sentence getting out of the first section boundaries.
  • exsentences - this one is the main issue. Android app uses this to get just enough content to run their own algorithm and recalculate the sentences, cards use it to get just 1 sentence. Ideally, that would be really bad to have a sentences parameter on the API, however, serving the whole article when only a first sentence is needed is a waist of resources and user's bandwidth. So, my proposal is to serve first 10 sentences by default, and have an optional query parameter ?full=true to serve it all. Important note, is that in case we follow this path, RESTBase will be calculating the first 10 sentences, and it will be doing it in a very naive way, by counting dots. Alternatively, we can loose a endpoint generality a bit, and always serve only N first sentences.

Some of the designs don't use a thumbnail (actually what's currently on master) but that could change, so I'm leaving the thumbnail in for now.

Just to make sure we are on the same page: the new extracts endpoint will not be serving a thumbnail, or wikidata, or anything else along with an actual extract.

Some of the designs don't use a thumbnail (actually what's currently on master) but that could change, so I'm leaving the thumbnail in for now.

Just to make sure we are on the same page: the new extracts endpoint will not be serving a thumbnail, or wikidata, or anything else along with an actual extract.

@Pchelolo: we (@GWicke, @mobrovac, @phuedx, @bmansurov, @bearND) had a brief videoconference to figure out whether there was one format we would be comfortable with. Sorry I didn't invite you - didn't realize the context of this task as being filed by you while rushing through my inbox!

The Android incantation from @bearND was okay by everyone. Are you comfortable with this?

The thinking was that we can return the thumbnail URL and Wikidata entity description (it's typically short) without having a substantial impact on the payload size, accounting for gzip compression.

bearND added a comment.EditedNov 5 2015, 10:02 PM

@dr0ptp4kt after the meeting I did some more sleuthing and found out that the Android app doesn't need the Wikidata description. So, that one can be dropped.
I'm on the fence about the thumbnail URL; well, actually leaning against it. In a couple of month we can probably say with more confidence whether we need it. (I think the Android app was the only one that had a potential need for it.) I would be ok with adding it later if the A/B test is in favor of the variant that needs it. I think we can always add more fields later, just removing fields is problematic.

I personally think that 10 sentences may be a bit on the high side, too. But I'd rather get input from @Dbrant here since he's written most of the link preview code.

@bearND, thanks. @Dbrant, is there a number less than 10 for the sentences that would be sufficient? If it needs to be 10, I think so be it.

@phuedx & @bmansurov, just to be safe, can you confirm the thumbnail isn't needed for Link Preview / Hovercards? If it isn't needed, that simplifies things, as @bearND is right we can just avoid it now and instead add it later if needed (@bearND, this implies that the mobile content service / Android client side api.php code would bear the burden for implementation using the thumbnail in the A/B test, and if it's borne out we need the extra field, the RESTbase spec would need an addition like you say).

GWicke added a comment.EditedNov 5 2015, 11:21 PM

Example query, based on @bearND's Android query minus wikidata description and thumb, per T117082#1786932.

https://en.wikipedia.org/w/api.php?action=query&prop=extracts|pageimages|pageterms&redirects=true&exsentences=10&explaintext=true&titles=Dog

Notes:

  • This doesn't set exintro currently. Adding that would remove the last sentence from the extract.
  • Response size for this example (using &format=json) is 853 bytes.

Is "pageterms" needed in the prop?

bearND added a comment.EditedNov 6 2015, 12:32 AM

@GWicke: @dr0ptp4kt is correct: you don't want to get the description then you want to also drop the prop pageterms. (The previous wbptterms=description just made sure that we only get the description out of all pageterms.) Same for prop=pageimages.

So, the query would be https://en.wikipedia.org/w/api.php?action=query&prop=extracts&redirects=true&exsentences=10&explaintext=true&titles=Dog
This results in a much smaller output.

@bearND, thanks. @Dbrant, is there a number less than 10 for the sentences that would be sufficient? If it needs to be 10, I think so be it.

The Cards extension needs just one sentence. Can we cache that too?

@phuedx & @bmansurov, just to be safe, can you confirm the thumbnail isn't needed for Link Preview / Hovercards? If it isn't needed, that simplifies things, as @bearND is right we can just avoid it now and instead add it later if needed (@bearND, this implies that the mobile content service / Android client side api.php code would bear the burden for implementation using the thumbnail in the A/B test, and if it's borne out we need the extra field, the RESTbase spec would need an addition like you say).

The current desktop implementation shows a thumbnail: https://www.mediawiki.org/wiki/Extension:Popups#/media/File:HoverCard_Inheritance.png
Ultimately, this is a product decision, so pinging @JKatzWMF.

GWicke added a comment.EditedNov 6 2015, 2:08 AM

@bearND, thanks!

The current desktop implementation shows a thumbnail

If you all can agree on a thumb size to request, then we could easily include a thumb as well. The size difference in this example is 880 bytes vs. 700 bytes without the thumb.

Regarding the number of sentences: If the main motivation for the separate end point is performance (rather than the complexity of extracting just the first sentence), then I would propose to start with the long version only, and then measure what the ideal-case performance benefits of adding a short version would be. At these response sizes (smaller than an ethernet packet), the performance effect of reducing the number of sentences for the long version it might not be noticeable. It could actually end up slowing things down by fragmenting the edge caches. If it turns out to be worth it, we can then add a short version later.

Dbrant added a comment.Nov 6 2015, 2:15 AM

is there a number less than 10 for the sentences that would be sufficient?

The link previews in Android display just two sentences, but the reason we request more than two is that we want to compensate for false-positive sentence detection by the server (for things like Jr., B.C., etc.) and handle it ourselves, so ten seemed like a comfortably conservative number. But we can probably reduce it to five or even four without much detriment.

dr0ptp4kt added a comment.EditedNov 6 2015, 5:39 AM

@Dbrant, I think your guesstimate is backed by desktop Hovercards. It uses 5 sentences (per source link from @phuedx above and network trace).

Desktop Hovercards also requests a thumbnail of width 600. This would be overkill on small form factor devices, but it seems our belief is that for small form factors we'll likely ignore the thumbnail URL anyway (i.e., no harm). So can we reinstate the idea of having the thumbnail, but at 600px so this can support desktop Hovercards out of the box?

Here's an example of desktop Hovercards:

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts|pageimages|revisions&formatversion=2&redirects=true&exintro=true&exsentences=5&explaintext=true&piprop=thumbnail&pithumbsize=600&rvprop=timestamp&titles=Master_System&smaxage=300&maxage=300&uselang=content

The prop "revisions" and rvprop=timestamp is for the last edited piece, so I guess that would be one more thing to request to make it work with desktop as is.

If we agree the basic look and feel of desktop Hovercards is okay, following this example (but removing exintro that was ruled out and the cache instructional params smaxage, maxage, and [auth'd user override] uselang because we will handle caching stuff in RESTbase, that's the point) I think that gets us everything we need for now?

I double checked the iOS app source, and the way they're doing things doesn't fit for this garden variety case (they only do random and morelike generator things with more stipulations we can't account for here). Down the road if they would potentially implement Link Preview this RESTbase endpoint would of course be available. And, going back to what @bearND said earlier about adding things, if an additional field like Wikidata description is needed (which is the prevailing form currently in the forthcoming iOS app for other cardlike things), that would be easy enough to add.

To sum up the discussion, currently we need:

  1. 10 sentences from the extracts, in plain format, without any additional parameters.
  2. A 600 pixels thumbnail
  3. A timestamp of the revision that we return - RESTBase would implicitly know that.

This is not exactly an extract, so we could call the endpoint differently, for example summary, as it returns a summary of the page content. With this naming we'd be able to add other things in there too.

I think I could implement the first prototype over next couple of days, to get a feeling for performance impact, that we have for our use-cases. The first version wouldn't contain the thumbnail, to simplify getting it running.

This is not exactly an extract, so we could call the endpoint differently, for example summary, as it returns a summary of the page content. With this naming we'd be able to add other things in there too.

This is my feeling exactly. I thought this discussion was around getting the TextExtracts API behind RESTBase, not creating a preview/summary service. It's not that the latter isn't necessary – it's required, IMO – but these are clearly two very separate concerns and consequently should be two very separate services.

This is my feeling exactly. I thought this discussion was around getting the TextExtracts API behind RESTBase, not creating a preview/summary service. It's not that the latter isn't necessary – it's required, IMO – but these are clearly two very separate concerns and consequently should be two very separate services.

That's what I meant. The question is whether an extracts endpoint alone provide any value to API consumers, or does the summary path open more opportunities. RESTBase could combine some arbitrary pieces of information (text extract and a page image for example) into a single summary endpoint pretty easily, although it's arguable RB is the perfect place for that.

The problem with this approach is in invalidation: the more data we combine, the bigger is the set of circumstances when that should be invalidated, the more frequently the invalidation happens. And as we currently invalidate stuff with by scheduling a request with cache-control: no-cache header, we would need to invalidate all the pieces, while may be only one piece should be regenerated.

An alternative approach, is to provide a separate extract endpoint, then a separate page image endpoint, and then a combined summary endpoint. This would provide more granular content regeneration.

@Pchelolo, looks like we only need 5 sentences, not 10.

I noticed you said "we need...A 600 pixels thumbnail" but then said "first version wouldn't contain the thumbnail, to simplify getting it running."

Are you referring to the thumbnail URL in the payload or the actual thumbnail image data (e.g., base64)?

I believe the thumbnail URL in the response (which comes from the pageimimages/piprop/pithumbsize parts of the URL) doesn't make things more complex, but is there some complicating factor?

For the desktop use case, the thumbnail URL is needed, whereas the image data isn't needed directly in the returned payload (because the browser will retrieve the image data via the provided thumbnail URL in a separate request; in other words, base64 encoded data doesn't need to be embedded). In the mobile use cases, if that thumbnail URL is present, it doesn't do harm (note @GWicke's point about packet size).

I agree, naming it differently is a good idea. As "Hovercards" is too specific and on mobile we're referring to it as "Link Preview", I suggest we call the endpoint preview.

With the following proposed URL format (last parameter is the critical lone title) we aren't "doing anything special" with regard to transforming the response at the server. But conceivably we could apply the additional text munging of the returned "extract" field down the road to save the client a step (Hovercards on desktop and Link Preview on Android munge the text, and mobile web will in all practicality need to do so as well in order to make the preview digestible to the user). As the four consumers (desktop web, mobile web, Android, iOS) of this specific endpoint are all managed by Reading, conceivably this wouldn't be a big deal. As a side note, any client munging that is already occurring could continue to be applied, it just would be for all intents and purposes a no-op if it had already been done at the server.

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts|pageimages|revisions&formatversion=2&redirects=true&exintro=true&exsentences=5&explaintext=true&piprop=thumbnail&pithumbsize=600&rvprop=timestamp&titles=Master_System

It seems like the pragmatic thing to do right now is standardize on this calling format.

But if we think we need an additional request format for the preview use case that doesn't have a thumbnail (e.g., to cater exclusively to the mobile use cases), this would be it:

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts|revisions&formatversion=2&redirects=true&exintro=true&exsentences=5&explaintext=true&rvprop=timestamp&titles=Master_System

@Pchelolo, I think right now we probably don't need the alternative approach of "a separate extract endpoint, then a separate page image endpoint, and then a combined summary endpoint" given the concrete use case we've been covering is about Hovercards/Link Preview (others, please, as always, please speak up if you disagree...or strongly disagree). I do in general get the potential benefits, though, of breaking things up and approaching composition at RESTbase - it's exciting!

I'm aware of some other concrete use cases in the wild. SMS/USSD support in the Global South (fulltext extracts and section by section extracts) are one area - this as I recall was the primary motivator of the extracts work in the first place. I've heard directly or implicitly others use it for text only article and section presentations for other purposes. And I anticipate internet based messaging apps will be another avenue for this in the future (again, probably something with section by section extracts), as could be alternative content packaging for reading devices. Naturally, those would need to be addressed as separate endpoints, ideally in consultation with developers invoking the endpoint.

I do think this preview thing is the most pressing thing that should be approached first. Thinking ahead, do we have a couple other evident and common canonical request/response formats involving TextExtracts that are borne out by the request logs?

Are you referring to the thumbnail URL in the payload or the actual thumbnail image data (e.g., base64)?

Thumbnail URL definitely, sorry for not being clear.

I believe the thumbnail URL in the response (which comes from the pageimimages/piprop/pithumbsize parts of the URL) doesn't make things more complex, but is there some complicating factor?

Making it either way is pretty simple, I've hacked up the code for extracts endpoint over a couple of hours today, so technically this is all pretty simple.

I agree, naming it differently is a good idea. As "Hovercards" is too specific and on mobile we're referring to it as "Link Preview", I suggest we call the endpoint preview.

So, /page/preview/{title} would do just right. However preview feels like it's implying a specific use-case for the data, so I like summary more, but I don't have any strong feeling about it. We could also add historic view of this data by adding an optional {revision} parameter, but looks like it's not needed. Omitting it could help us save some storage.

I must admit I find /page/summary more appropriate than /page/preview. The latter implies something visual (to me, at least).

bearND added a comment.Nov 6 2015, 4:36 PM

I think the thumbnail size requested should be then 640 to use one of the bucket size values to avoid fragmenting the image cache:
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FMultimediaViewer.git/f9e7bae91a8032fa13fc68114a0d57d190ea77f9/resources%2Fmmv%2Fmmv.ThumbnailWidthCalculator.js#L69

I was thinking about adding a mobile-preview endpoint in the Mobile-Content-Service but it looks like this preview endpoint covers most of what we would need. The only thing that's missing is the infobox data (T108346), which the app is currently not using yet. I'm not sure when/if it would. If we would need that then I could still add another endpoint to the Mobile-Content-Service which adds this part.

For the response format, I think from https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts|pageimages|revisions&formatversion=2&redirects=true&exintro=true&exsentences=5&explaintext=true&piprop=thumbnail&pithumbsize=600&rvprop=timestamp&titles=Master_System we only need extract, thumbUrl, and something like revisionTime. We could flatten the pages array and the first object since this API only returns one item. I don't think we need redirects, normalized, thumbnail.width/height. Not sure about pageid, ns, title.

GWicke added a comment.EditedNov 6 2015, 4:36 PM

Yeah, something like 'summary' or 'lede' / 'lead' sounds better to me as well. Advantage of something like 'lead' might be that it's more specific than 'summary', which leaves more room for other kinds of summaries.

Cool, summary it is. @Pchelolo, @GWicke, @bearND, @phuedx, @bmansurov: do you have a strong preference for formatversion=2 versus unspecified formatversion? As it's a singular title I don't know that it much matters, but I guess best to check.

@Pchelolo, got a running example that includes the 600px thumbnail or able to whip one up?

Cool, summary it is. @Pchelolo, @GWicke, @bearND, @phuedx, @bmansurov: do you have a strong preference for formatversion=2 versus unspecified formatversion? As it's a singular title I don't know that it much matters, but I guess best to check.

RESTBase calls to the MW API use formatversion=2 by default, so we'd better stick to it.

@Pchelolo, got a running example that includes the 600px thumbnail or able to whip one up?

I have the code, but it's still a WIP. Will be ready on Monday, it's already pretty late here. Is it 600px or 640px by the way?

bearND added a comment.Nov 6 2015, 5:43 PM

@GWicke I think lead might be confusing since we have a lead section request in the Mobile-Content-Service that is for actually showing the page content. I'd prefer summary.
@dr0ptp4kt I still think it should be 640 instead of 600. I don't think it matters if we use formatversion=2 or not since there are no booleans involved.
@Pchelolo Whatever is the default for formatversion is fine with me.

Cool, summary it is. @Pchelolo, @GWicke, @bearND, @phuedx, @bmansurov: do you have a strong preference for formatversion=2 versus unspecified formatversion? As it's a singular title I don't know that it much matters, but I guess best to check.

RESTBase calls to the MW API use formatversion=2 by default, so we'd better stick to it.

Yup. formatversion=2 si going to become the new default anyway in MW.

@Pchelolo, got a running example that includes the 600px thumbnail or able to whip one up?

I have the code, but it's still a WIP. Will be ready on Monday, it's already pretty late here. Is it 600px or 640px by the way?

640px.

@bearND, @dr0ptp4kt: Okay, lets go with summary.

@bmansurov this thread moved faster than I do through my inbox! Seem's like you have moved on, but since the question was asked: a thumbnail is not currently required for the first instance of link preview (though I expect it will be added later), per T113243

So we'll plan on having the thumbnail URL in the return payload (which will be ignored generally in mobile modalities, for now), but I think we should consider setting it to 320px instead of 600px or 640px.

https://github.com/wikimedia/mediawiki-extensions-Popups/blob/master/resources/ext.popups.renderer.article.js#L45 suggests that the pixels on desktop Hovercards should be a multiple of 300. Hovercards applies the bracketed multiplier scales of 1, 1.5, or 2.0 in https://github.com/wikimedia/mediawiki/blob/master/resources/src/jquery/jquery.hidpi.js#L70 (meaning on a retina Macbook Pro, for example, you might get 300 x 2 = 600). But the Android app uses its base thumb metric at PREFERRED_THUMB_SIZE as 320 instead of 300. The Android codebase also applies scaling factors, so it may be the case that 640 would be the correct dimensioning, even if not using the 640 image in a Link Preview context (e.g., for prefetching, although I don't think so).

Prescribed widths at the server side (as reflected in File: pages) start at 320, 640... per https://github.com/wikimedia/mediawiki/blob/5f15ac0058b56e8782829170b4397b2bacfa8089/includes/DefaultSettings.php#L1270. So that seems to better guarantee a warmly cached image.

If we go with pithumbsize=320 later we will need to adjust ext.popups.renderer.article.js and friends so it assumes the 320 image on desktop Hovercards. These things are supposed to be snappy, and it seems like we're not being economical with bandwidth/speed when tripling or quadrupling the file size at the 640px size (in a busy image this difference is significant).

You guys okay with a thumbnail size at 320? Or is there an overriding reason to use 640 instead for this summary use case?

bearND added a comment.Nov 9 2015, 4:11 PM

@dr0ptp4kt 320px thumbnail works for me.
@Pchelolo Thanks for the patch. I personally think we could drop the height and width of the thumbnail. At least on the Android app we don't use it.

@bearND I've removed the width and height properties from the response.

@bearND I've removed the width and height properties from the response.

These would come in rather handy for display calculations while awaiting the actual thumbnail, wouldn't they? Besides, that represents a rather low overhead. I think we should keep them.

I'm in favor of including the dimensions as well. At the very least, having the aspect ratio lets the client render the image in the right dimensions, which avoids content moving around.

Longer term, my preference would be to include the *original* dimensions & let the client select the thumb size & quality, but that will require the definition of a public thumb API. There is some discussion around that in T66214.

Tgr added a subscriber: Tgr.Nov 9 2015, 10:52 PM

Sentence segmentation is actually a rather hard task (see T61641), and the number of characters / sentences that can be reasonably displayed varies wildly with the interface. There definitely should be a way to extract a given number of sentences (once we are able to do that on the backend, that is). Maybe by the API marking up sentence borders or something like that, that plays better with caching.

@Tgr, I was wondering about that as well. Returning an array of sentences could make it relatively easy for the client to make its own decision, for example.

For pragmatic reasons, we would prefer not to deviate from an existing API response without strong reasons. That said, we could reconsider if this turns out to be a significant issue in practice.

@Tgr I totally agree, that in the current layout the endpoint lack generality, and that it would be really nice to have an option to get complete text with sentence border markup. However, splitting-up sentences and providing such a markup is not a RESTBase job. When (if) TextExtracts extension provides an opportunity to get text sliced to sentences, we might reconsider this endpoint. But until that, it's already nice to get the endpoint running as it's discussed here (returning 5 sentences and a thumbnail) to get an understanding of the speedup we can get from that.

@Pchelolo, the pull request suggests the thumbnail.source is on http://. Would you be able to update it so that suggests.thumbnail's URL value is on https:// instead? I'm thinking this reduces the privacy attack surface a bit, avoids redirect/HSTS handshakes, reduces programmer burden, etc. (in practice, how client code implicitly or explicitly handles this stuff varies; better to just make it explicit, I think).

dr0ptp4kt added a comment.EditedNov 10 2015, 2:51 PM

Thinking ahead, I agree with what others have said that down the road serving a response containing richer thumbnail and sentence data structures within this thing would be cool. To be clear, now doesn't seem the right time given we're bumping up against the December 2015 production freeze - better I think to get something working now. But maybe we could rally together once this first iteration is done to get something lined up for January deployment?

Anyhow, it seems like we'd want:

  • The thumbnail array / an object with 1.0, 1.5, and 2.0 DPI thumbnail URL expressions for the base 320 dimension.
  • The pre-transformed sentence output (e.g., with parenthetical removal) in one text field (for the RESTbase endpoint)

That would leave the client programmer with only having to adjust for the image pertinent to the pixel density, and generally otherwise just trusting whatever the service provided.

Speaking of transforms and reducing client development time: @bearND, which block of code does the sentences transformation on Android (I understand Mobile Content Service may be doing the same, although, I don't think we should pin web's Link Preview on Mobile Content Service given the fast approaching freeze)? I was thinking @bmansurov and @phuedx may want to take a look - probably easy enough to grab that and put it into the ResourceLoader module?

Pchelolo added a comment.EditedNov 10 2015, 2:54 PM

the pull request suggests the thumbnail.source is on http://. Would you be able to update it so that suggests.thumbnail's URL value is on https:// instead? I'm thinking this reduces the privacy attack surface a bit, avoids redirect/HSTS handshakes, reduces programmer burden, etc. (in practice, how client code implicitly or explicitly handles this stuff varies; better to just make it explicit, I think).

@dr0ptp4kt RB is just proxying the response returned by PageImages for a thumbnail, that's why it gives http right now, so perfect solution is to switch the extension to https. But for now could massage the response a bit to s/http/https in the response.
I'll update the PR.

... But for now could massage the response a bit to s/http/https in the response.

@Pchelolo: The Readers-Web-Backlog team maintain the PageImages extension. If there's a feature that needs adding which makes this endpoint as simple as possible, then that should be where it's added.

The same goes for TextExtracts.

@phuedx Ok. So, RESTBase issues requests to MW API over HTTP, as these are internal request, so no need for HTTPS.

However, PageImages extension expand the thumbnail URI with PROTO_CURRENT (here). If it's possible to make it use PROTO_HTTPS or adding a parameter to control that, it would be really nice, as we wouldn't have to massage the output in RESTBase

@phuedx @Pchelolo, would proto-relative urls (//en.wikipedia.org/...) work too?

@GWicke, @Pchelolo: I was about to ask the same question. So PROTO_RELATIVE?

@GWicke, @Pchelolo: I was about to ask the same question. So PROTO_RELATIVE?

Yup, that's the most correct way to go, also because that means the extension will continue to work on third-party installs. +1.

Change 252276 had a related patch set uploaded (by Phuedx):
Return protocol-relative URLs for images

https://gerrit.wikimedia.org/r/252276

Speaking of transforms and reducing client development time: @bearND, which block of code does the sentences transformation on Android (I understand Mobile Content Service may be doing the same, although, I don't think we should pin web's Link Preview on Mobile Content Service given the fast approaching freeze)? I was thinking @bmansurov and @phuedx may want to take a look - probably easy enough to grab that and put it into the ResourceLoader module?

The last three methods in LinkPreviewContents.java do the client side text extract munging. This one takes the locale into account.

I think the mobile Content service does it slightly differently since it doesn't take the current locale into account. It's using a bunch of regexes instead, which reminds me we can probably get rid of a lot of code there since it apparently comes form Parsoid.

GWicke added a comment.EditedNov 10 2015, 9:58 PM

Perhaps it is worth updating the textextracts return value to make it easier to select sentences, after all? It sounds like we already have a good amount of duplicated effort around sentence parsing.

dr0ptp4kt added a comment.EditedNov 11 2015, 12:21 AM

@GWicke, I'm more partial to just providing the fully transformed text payload as one contiguous string (which is as I understand is supposed to equate to EXTRACT_MAX_SENTENCES = 2 sentences total strung together), as opposed to providing an array of strings. I'm thinking people writing UIs may be more prone to making errors re-concatenating stuff and dealing with different stop characters otherwise. Does that make sense, or are there other angles to consider for this first use case?

In any event, this seems ripe for centralizing the logic. I just don't know if now is the time to centralize the logic. How much work is it to introduce this logic into the backing service for the RESTbase endpoint?

In any event, this seems ripe for centralizing the logic. I just don't know if now is the time to centralize the logic. How much work is it to introduce this logic into the backing service for the RESTbase endpoint?

There is no backing service for this. The idea is that RESTBase would pull in the info from the MW API when a page changes and would store/cache the result. So, I'm really not keen on RESTBase doing the extra work, as that impacts its performance.

I posted a comment about having an optional parameter, proto=false, for the pageimages API. This would allow for RESTbase middleware (and other direct api.php pageimages consumers) to explicitly signal they want a protocol relative URL, obviating the need for investigation about backwards incompatibility for the time being.

@mobrovac, makes sense about the backing service and performance piece. I suppose in the interim it would be nice for consumers to have links to examples of how different clients parse today. I for one would prefer that at least the Wikimedia clients do it in a consistent manner.

Tgr added a comment.Nov 11 2015, 7:23 PM

PROTO_RELATVE will expose users viewing the snippet in a HTTP page to man-in-the-middle attacks. That's arguably a configuration problem though; filed T118413.

Tgr added a comment.Nov 11 2015, 7:42 PM

However, splitting-up sentences and providing such a markup is not a RESTBase job. When (if) TextExtracts extension provides an opportunity to get text sliced to sentences, we might reconsider this endpoint.

Sure, but it should be kept in mind from start, otherwise you might end up with a format that cannot easily be changed to include sentence splitting. E.g. if the API just returns a plaintext snippet, it's not going to be easy to introduce sentence separators in a backwards-compatible manner.

dr0ptp4kt added a comment.EditedNov 11 2015, 8:20 PM

Adding a new array (e.g., ['Hadoop is a "big data" technology', '.', ' ', 'It is used by data scientists', '.', ' ' ...]) to the response should be easy enough if we want to do that in the future. I think we'd want to add that to the TextExtracts api.php endpoint (perhaps requiring an additional parameter specified against api.php) so that RESTbase is relieved of the responsibility, though, to minimize the RESTbase performance concern raised by @Pchelolo.

As for reducing the burden on consuming clients (all of which will be newly implementing their client side code) of doing any extract munging: it seems the pertinent thing would be to make sure we mark the unsplit extract field as "unstable" for now in the specification, with the understanding that we will eventually standardize that response to be a standardized two sentences; the advice to the developer would be "munge if you like, but we're going to make this painless for you in due time". When that field becomes stable, clients could remove their munging, which in practice would be no-ops anyway.

There's discussion on the Gerrit change about the precise implementation, but in any event, the way the RESTbase config calls the actual api.php endpoint should be such that the thumbnail URL is an https:// one.

We should wrap up the api.php implementation soon in order for the timing of All the Things to work out without hitting the December freeze.

We should wrap up the api.php implementation soon in order for the timing of All the Things to work out without hitting the December freeze.

252276 is but one way of fixing this problem for now. Certainly, getting the API to return https URLs for thumbnails *in the API* is probably the correct way but it's been suggested before that the RESTBase service *could* munge the URL.

There's no reason why the workaround and the correct solution couldn't be implemented in parallel – the duplication of effort argument doesn't hold too much weight here as it feels like the workaround is a one-liner.

There's no reason why the workaround and the correct solution couldn't be implemented in parallel – the duplication of effort argument doesn't hold too much weight here as it feels like the workaround is a one-liner.

Agreed. The https bit is very minor and won't block us either way.

@Tgr: I agree that there are several things that should be improved about the output before marking this as stable. For that reason, I think it makes sense to start this end point as "experimental", which explicitly reserves the right to change it at any time. We should know more by January or so, when we can revisit this.

Right. @phuedx, @bearND should we just go with the existing RESTbase configuration and make it explicit that the clients should s/^http:/https:/ the URL for the time being? It would be nifty to get this deployed even earlier next week.

@mobrovac, if I heard you correctly yesterday, once we have a new format for calling the API such that it will return an https:// thumbnail URL (i.e., maybe in the next couple weeks, but maybe after the December freeze), we can then update the RESTbase config and in effect purge / truncate outdated stored responses (the ones with http:// thumbnail URL text payloads) so that all responses going forward would have the https:// thumbnail URLs. Do I understand correctly?

@dr0ptp4kt If I understand this correctly there would be no change to the pageimages API then. Since the /page/summary endpoint is a new endpoint that would be fine for the clients. We can easily prepend the protocol to the thumbnail URLs if missing[1].

[1] https://phabricator.wikimedia.org/diffusion/APAW/browse/master/app/src/main/java/org/wikipedia/util/UriUtil.java;d5607bde62fd28b9ddf531f8260f81cfadeb264e$60-62

@dr0ptp4kt: That's correct. Alternatively, we could add a little bit of temporary munging code that inserts https:// before storing it.

@bearND, with the existing response, the URL actually would be starting with http:// instead of the protocol relative //.

@Pchelolo, @GWicke, @mobrovac, will it be acceptable to you guys to transform the thumbnail URL from http:// to https:// for the short run (e.g., for deployment next week)? Or does that destroy performance benefits?

@dr0ptp4kt, @GWicke Then I would prefer if the /page/summary endpoint would change the protocol from http to https.

@GWicke, @Pchelolo, @mobrovac: is @bearND's request for the protocol transformation feasible soon? Or do consumers of the endpoint need to implement this on their own for now?

@GWicke, @Pchelolo, @mobrovac: is @bearND's request for the protocol transformation feasible soon? Or do consumers of the endpoint need to implement this on their own for now?

We have discussed this in our meeting and we will provide a s/http/https/ in RESTBase for now, so that clients do not need to do magical stuff. However, I would urge you to change that on the MW side ASAP, so we can remove it from RESTBase.

As per our sync meeting last week, this will be deployed on the RESTBase side Thursday or Friday the latest.

@mobrovoc, @Pchelolo, @GWicke - double thanks! Looking forward to trying it out!

GWicke added a comment.EditedNov 20 2015, 2:55 AM

https://github.com/wikimedia/restbase/pull/421 is now merged, but is not deployed yet. Basically the entire implementation can be found in this yaml file, with a small helper for the http -> https transform added in this JS module.

I'm testing this now, but don't expect this to be deployed before Monday as it is already Thursday night.

GWicke closed this task as Resolved.Dec 3 2015, 7:49 PM

I'm calling this resolved. Lets track further improvements in follow-up tasks.

dr0ptp4kt added a comment.EditedDec 3 2015, 7:52 PM

Thanks, @GWicke. As an aside, I've noted in a separate meeting that the code has a TODO on determining the desired last-edited datetime stamp. Off the cuff, I'm not sure why that's there in Hovercards - any updates to Hovercards would examine the need for it, to be sure (CC @Nirzar).

JKatzWMF removed a subscriber: JKatzWMF.Dec 3 2015, 10:07 PM

Change 252276 abandoned by Phuedx:
Return protocol-relative URLs for images

Reason:
Way, way out of date.

https://gerrit.wikimedia.org/r/252276