Page MenuHomePhabricator

Endpoint for average view rate in Pageview API
Open, MediumPublic

Description

View rate -- a rankable value that represents the general level of viewer interest in an article (or other page)

Use cases:

  • I'm a WikiProject maintainer and I want to sort my worklists by the articles view rate, I cannot really do this well through pageview API as i need to request pageviews for an article since the beginning but I have no "populrity baseline" to compare those to.
  • I'm SuggestBot and I'd like to give my users a notion of how often an article is viewed when I recommend that they edit it.
  • I'm a researcher and I want to compare the rate that of views with the rate of edits.

Spec:

  • Each article would have a single rate.
  • A rate could be either one value (average views per <time unit>) or two values (start date of data, total number of views)

Event Timeline

Thoughts on possible implementation:

  • output table is something like average_monthly_pageviews (page_id, page_title_latest, page_titles_previous, months_included, monthly_average_views)
  • run a monthly job
  • join mediawiki_page_history to pageview_hourly to get the most recent month by page_id
  • join the result to average_monthly_pageviews to get the average so far, compute the new average and write a new average_monthly_pageviews

The result would be a fairly efficient job that gets us:

  • average views per page_id
  • all the titles a page has had in one field, per page_id

And then we can serve this via another api endpoint like metrics/pageviews/per-article-average

  1. ?
  2. profit!

Sounds good @Milimetric

@Nettrom was very interested in this during our last discussion. I wrote the SuggestBot use-case based on a conversation with him.

Nuria added a subscriber: Nuria.

We should have an "editing tools" tag as this really seems like a backend that would be useful for those.

Milimetric triaged this task as Medium priority.May 8 2017, 2:27 PM

@Hall1467 & @DarTar FYI, this will make our work for entity usage (view rates) much easier. But it's not coming soon, so this is just an FYI.

@Nettrom, I think this is our biggest blocker for getting the Article Importance model hosted in ORES.

I'm a WikiProject maintainer and I want to sort my worklists by the articles view rate, I cannot really do this well through pageview API as i need to request pageviews for an article since the beginning but I have no "populrity baseline" to compare those >to.

I get that this would not solve your use case 100% but you can compare pageviews of any one page to main page: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Cat|Main_Page

This is not mean for programatic access but it is a good proxy for "popularity: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Main_Page|Donald_Trump

Coming back to this I have a bunch of questions, so I'll just ask them and see where we go from there. Apologies if this is counterproductive, feel free to let me know how to improve in future work.

How do we expose the data to the public, and how often should it be updated?

The SuggestBot use case prefers updated data. When I first wrote it, I decided on calculating view rate using a two-week period and using the most recent data available because I wanted it to be affected by recent trends and being as close to current as I could get it. SuggestBot is currently small scale and not time critical, meaning the pageview API serves it well (ref [1]).

As a researcher, I might be working with millions of pages, a case where a monthly dump will work fine.

How do we define view rate? SuggestBot uses a 14-day average as mentioned above. In our 2015 ICWSM paper (ref [2]), we used 28 days as a substitute for a calendar month, and I've used that as a basis in the current work on article importance (ref [3]) as well. In a recent paper by Sen et al (ref [4]), they used the median of a 100-day sample over a one-year period as their view-based ranking component. I'm not sure if we should use all available data if possible, restrict it to a given timespan, allow an arbitrary timespan, or sample like Sen et al did. Do we restrict it to articles, or are we interested in all types of pages?

Regardless of definition, how do we handle pages for which all data isn't available? In the importance prediction project, I'm working on treating those articles separately and using either a 99% or 95% confidence interval if possible, or perhaps just using the lower end of the confidence interval (the latter is what Microsoft's TrueSkill does, which I thought of when working on this (ref [5])).

Lastly, do we take views from redirects into account? Hill and Shaw (ref [6]) recommends that research on pageviews should. Our ICWSM paper does, and if I remember correctly, WikiBrain (ref [7]) does as well. Handling the research use case therefore suggests either providing only a dataset with redirects taken into account, or having it as an option.

Quick summary:

  1. How do we expose the data, and how up to date should it be?
  2. How do we define "view rate"?
  3. How do we handle pages without complete data?
  4. Do we consider redirects?

References:

  1. https://github.com/nettrom/suggestbot/blob/master/suggestbot/utilities/page.py (doesn't use the Python library, but might in the future)
  2. Warncke-Wang, M., Ranjan, V., Terveen, L., and Hecht, B. "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities" (ICWSM 2015) http://www-users.cs.umn.edu/~morten/publications/icwsm2015-popularity-quality-misalignment.pdf
  3. https://meta.wikimedia.org/wiki/Research:Automated_classification_of_article_importance
  4. Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse Dippenaar, Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, Brent Hecht "Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement" (IUI 2017) http://www.shilad.com/static/cartograph-iui-2017-final.pdf
  5. https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/
  6. Hill, Benjamin Mako, and Aaron Shaw. "Consider the redirect: A missing dimension of Wikipedia research." (OpenSym 2014) https://mako.cc/academic/hill_shaw-consider_the_redirect.pdf
  7. https://shilad.github.io/wikibrain/

Regarding views from redirects, see also T121912 and note that if/when T53736 is implemented, it may imply major changes in this regard.