Page MenuHomePhabricator

[SPIKE] As a user I'd like to see the top most read articles for my Wikipedia in the Explore feed
Closed, ResolvedPublic

Description

Use the Pageviews API to add a feed section showing the most popular current articles on your language's Wikipedia.

Top viewed pages will be timely and relevant, and give readers a way to see what other readers are into right now on Wikipedia. Its a natural fit for the purpose and format of the Explore feed.

Step one is to investigate the current suitability and capabilties of the Pageview API. We can then get into potenitial design and product details.

Event Timeline

JMinor raised the priority of this task from to Needs Triage.
JMinor updated the task description. (Show Details)
JMinor renamed this task from As a user I'd like to see the top most read articles for my Wikipedia in the Explore feed to [SPIKE] As a user I'd like to see the top most read articles for my Wikipedia in the Explore feed.Jan 15 2016, 8:42 PM
JMinor set Security to None.

Grabbed this as a fun weekend ticket.

See the PR for a WIP screen recording animation:
https://github.com/wikimedia/wikipedia-ios/pull/388

Will investigate the other open questions around using this API probably (here and there) next week. Already spoke to Dario a bit and was going to chat with Kevin next.

So far, @Mhurd has found the following issues w/ integrating the API:

  • Not able to limit number of results
    • Additionally, will eventually want to paginate results
  • Need to perform aggregation of article extract, page image, wikidata desc. etc. (i.e. "card" data)
  • Need to (heuristically?) filter results to only specific namespaces and/or restrict certain pages (e.g. Main Page, and whatever "-" is)

@Milimetric are there any plans to address these? on a larger scale, how suitable is this API for end-user client consumption?

So far, @Mhurd has found the following issues w/ integrating the API:

@Milimetric are there any plans to address these? on a larger scale, how suitable is this API for end-user client consumption?

I'll re-order the bullet points below and answer, but please create a task for each thing you'd like us to work on, and assign it what you see as the priority. You can tag it Analytics and ping me if we ignore it for too long.

  • Not able to limit number of results

Easy to do

  • Additionally, will eventually want to paginate results

Also easy to do, but the server would just simulate this, since it's really fast for us to just get all 1000. We won't be able to get more than 1000 though

  • Need to perform aggregation of article extract, page image, wikidata desc. etc. (i.e. "card" data)

I personally feel like there should be a separate API for this, see more below

  • Need to (heuristically?) filter results to only specific namespaces and/or restrict certain pages (e.g. Main Page, and whatever "-" is)

This, like the aggregation above, is hard right now, we don't have a great way to join to mediawiki data to get this information. A join on article title works for the happy cases, but article renames, weird characters, and other annoying corner cases make it an imperfect thing. We are working on getting page_id into the pageview data pipeline, and then all this will be much easier.

@Milimetric thanks!

I personally feel like there should be a separate API for this, see more below

Agreed, this is more of a long-term aspiration, much like the pagination. In fact, we might not need to paginate the pageview API if the end-client doesn't use it directly.

This, like the aggregation above, is hard right now...

Is it not feasible to include the page's namespace when querying the data, filtering at that point? Even exposing the namespace at the output so end/middleware clients can filter is better than nothing.

@Milimetric on second thought, i'll just file tasks for some of this so we can discuss in detail there. the namespace/filtering issue is the biggest dilemma, as we can get by w/ client-side aggregation until an appropriate aggregation strategy is devised for the server side.

@Milimetric one question you didn't address was the stability of the API itself. the Swagger spec declares it as "experimental," would you caution us against using it?

@Milimetric one question you didn't address was the stability of the API itself. the Swagger spec declares it as "experimental," would you caution us against using it?

Sorry about that. I think is stable enough to be used, in my opinion. The output is being cached and we tried to make sure the cache doesn't get too fragmented by the parameters (one of the main reasons we went with RESTBase). So whereas the API itself can only be hit around 200 times per second, the caching should allow us to use it for user-facing features. If we're starting to implement these kinds of features on the sites and apps though, we want to know so we can make some improvements we have prepared (like hosting a router for the API on each wiki so the client doesn't have to do two DNS lookups). But, for now, yes, this should be stable enough to use. The "experimental" label is kind of over-cautious copy-pasting at this point :)

Since this is the Spike ticket are we spiked enough? Theres a follow-on ticket for actual implementation:
T124716

JMinor triaged this task as Medium priority.Jan 28 2016, 9:50 PM

Resolving spike, the output of which was the task for the actual feature: T124716