Page MenuHomePhabricator

Add page view data to summaries?
Closed, DeclinedPublic

Description

The summary endpoint is a great way to get metadata about a page that is useful to display to users to help them decide if they would like to read an article.

Page views have been a great addition to the APIs and are increasingly being added to designs for apps. Just like wikidata description and thumb, these are starting to become a basic piece of data that clients may want to display in many other contexts.

This leads to the question: Is it feasible to return Pageview data in the summary of articles?

How would this affect caching / cache invalidation / storage / CPU?

Event Timeline

This leads to the question: Is it feasible to return Pageview data in the summary of articles?

From the first impression - no. Right now we invalidate and serenader summary only when the page was edited, so we've got a really high cache hit rate and we don't really do a lot of computing to keep summaries in sync with the article content and Wikidata. However, page views are changed daily, so to keep summary in sync with the page view data we'd need to purge them from Varnish daily, partially re-render daily. We can think about 'on demand' solutions here, but even from the philosophical view I don't think page views fit in the summary. The endpoint returns you the "Short description of the concept under a certain title". How's page views connected to it?

Do you have any exact use cases for this?

…even from the philosophical view I don't think page views fit in the summary. The endpoint returns you the "Short description of the concept under a certain title". How's page views connected to it?

@Pchelolo From the Reading perspective, I'd probably categorize the summary to be "most important information a user would want to know about an article before committing to reading it".

In this context, page views over the past few days can convey both "how popular it is" and "how popular is it recently" which help inform the decision of whether to read it.

One specific use case that is for the sparkline on widgets (iOS only currently):

IMG_0603.PNG (2×1 px, 3 MB)

This design is going to be brought to the feed for news items (and possibly for Top Read) for both platforms.

Currently we are making several requests for this data to construct the view. It would be nice if this was returned as part of the summary in those cases.

mobrovac subscribed.

An alternative would be to add a link to the corresponding pageview API endpoint for the title whose summary is being requested so that clients can go look for that information (you would still need to make multiple requests so this is not really what you are looking for here). But I agree in general with @Pchelolo, I don't think pageview data fits there. Perhaps if the summary is thought of as the summary meta-data for a title, but the complexity around merging it in decreases the hit rate drastically. One other tricky part is that we need to wait for the AQS data to become available. That means there would need to be a background mechanism that does that.

An alternative would be to add a link to the corresponding pageview API endpoint

The link contains a date, so we can't really do that. To avoid that we'd need to put some kind of a template for the link, which is overcomplicating things in my point of view.

@Fjalapeno In those cases we could add another item to the $merge array like this

$merge: [
  "https://en.wikipedia.org/api/rest_v1/page/summary/Johann_Wolfgang_von_Goethe",
  "https://en.wikipedia.org/api/rest_v1/page/views/Johann_Wolfgang_von_Goethe"
}

assuming we'd have another endpoint to display the page views called page/views, metrics/views, or whatever. It should be a view to something like

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Johann_Wolfgang_von_Goethe/daily/2017012800/2017020200 minus all the extra keys. I think we only need the views and timestamp keys. timestamp in a more standard format would be good, too.

We'd have to make sure that the cache of this gets invalidated at the right time, which is the tricky part. Change prop to the rescue? If we had a good solution for this I think we'd also want it for MCS most-read.

@mobrovac @Pchelolo @bearND I think these are fair criticisms.

An alternative would be to add a link to the corresponding pageview API endpoint

The link contains a date, so we can't really do that. To avoid that we'd need to put some kind of a template for the link, which is overcomplicating things in my point of view.

I don't think this is necessary based on your comments.

Reframe the question:
A client fetches list of 20 articles (say based on a search) and wants to get last 5 days of page views for all 20 - can you think of an efficient way to do that? Maybe a change to the pageview API?

(Just for clarity, the iOS app makes 20 API calls to do this now)

Hm... Making some kind of a bulk-request functionality in the page views API goes against the REST philosophy and we've tried to avoid that.

Adding page views to summaries effectively means we can't cache for more that 24 hours, needs building a 'notification system' about when the data was loaded into the AQS and make us effectively wipe out all caches and storage every day.

So, maybe 20 requests is not that bad? With adoption of http 2 individual requests to the same domain become really cheap.. No really concluding anything, just thinking outloud..

I think making 20 requests is fine if it's ok on your end. Just making sure that's how you want it to be handled

@Fjalapeno I'd like to push back a bit on adding pageviews #s to the summary data at this time. Maybe that can be revisited later when web wants this, too. We'll cross that bridge when we get there. In the meantime I propose to hydrate pageview data as part of the higher level request. For the example of the Explore feed it should be possible to add the pageview data via an additional $merge array item, as I mentioned earlier. The Explore feed case is actually the simpler case since the aggregated feed endpoint has a date parameter. That should still be nicely cacheable. In fact, we already have that for the most-read portion of it as of T148445: https://en.wikipedia.org/api/rest_v1/feed/featured/2017/02/01
(In this case it's done by MCS code and not using the hydration feature but I think hydration would be doable if there was a another pageview endpoint which just provided what we need in the form we want.)

It gets a bit more involved to do this with endpoints which don't have a date parameter. Then the pageview data of a new date-agnostic endpoint should be refreshed at least once a day. I'm not sure I understand the search example since we don't have a search endpoint provided through RESTBase. It really depends on the concrete example IMO.

I think we've reached an agreement here that the page views data will not be added to the summary endpoint. I'm closing the task as Declined