Page MenuHomePhabricator

As an end-user I shouldn't see non-articles in the list of top articles
Closed, ResolvedPublic

Description

The inclusion of pages which a user wouldn't consider an article in the pageview/top API drastically hinders the usefulness of any feature using this data. Solving this problem at the lowest API level possible will allow downstream API clients (including middleware services) to use this data to build features with confidence (and w/o regexes and heuristics).

For example: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2016/01/18

"articles": [
  {
    "article": "Main_Page",
    "views": 19257663,
    "rank": 1
  },
  {
    "article": "Special:Search",
    "views": 2144393,
    "rank": 2
  },
  {
    "article": "-",
    "views": 758591,
    "rank": 3
  },
  ...

Event Timeline

BGerstle-WMF raised the priority of this task from to Needs Triage.
BGerstle-WMF updated the task description. (Show Details)
BGerstle-WMF moved this task to Incoming on the Analytics board.
BGerstle-WMF updated the task description. (Show Details)
BGerstle-WMF set Security to None.
BGerstle-WMF added subscribers: Jdlrobson, Mhurd.

For example top.hatnote.com uses a meta query to filter out a site's main page, "-", and anything outside the main namespace:

From https://github.com/hatnote/top/blob/master/top/get_data.py#L54:

def is_article(title, wiki_info):
    '''\
    Is it an article, or some other sort of page? We'll want to filter out the
    search page (Special:Search in English, etc) and similar pages appearing
    inconveniently in the traffic report.
    '''
    skip = ['-'] + [wiki_info['mainpage']]
    prefixes = PREFIXES + wiki_info['namespaces']
    if title in skip:
        return False
    for prefix in prefixes:
        if title.startswith(prefix + ':'):
            return False
    return True

Doing this kind of heuristic as part of the API call is possible, then clients would get less than 1000 articles but I assume that's fine. But to truly cleanly report only Main Namespace articles, we need to access page table data from mediawiki databases in real-time. And right now the data crunched for the API is crunched by Hive in batches. So we're missing this bridge from mediawiki data to the Hadoop cluster. Building this bridge is a very high priority for us, but not an easy task. If someone sees a better way, I'm certainly open to it. Maybe I'm bundling too many problems together here and there's a vertical slice I'm missing.

we need to access page table data from mediawiki databases in real-time. And right now the data crunched for the API is crunched by Hive in batches. So we're missing this bridge from mediawiki data to the Hadoop cluster.

there's no way to denormalize that info when writing data to hadoop?

there's no way to denormalize that info when writing data to hadoop?

Not right now, for two reasons:

  1. whatever we query for that data would have to deal with the full firehose (200k / second). Cache would help, of course, because that data doesn't change very much, but we'd be essentially doubling our traffic instead of just joining to data we already have on the back-end. So joining seems like a better long term solution
  1. page_id is not consistently coming across with all types of requests, we're working on improving coverage of that though

another thing that's not obvious is how clients show pageview data across time zones. will need to talk more about the mechanics of when the pageview data updates and how that impacts UX across time zones.

In order to do this we need the page id, and the computation of top pages (rather than pageview api) can decide whether it is pertinent to include the page. But, again, page id is needed.

Fjalapeno renamed this task from As an end-user I shouldn't see non-articles in the list of trending articles to As an end-user I shouldn't see non-articles in the list of top articles.Jul 13 2017, 2:35 PM
Fjalapeno subscribed.

@Nuria is this still a valid ticket?

Mholloway claimed this task.

Services consuming the raw pageview data provided by AQS can and do filter out non-articles. This is a solved problem.

We still want to improve our endpoints in the future, allowing users to get top articles in specific namespaces, categories, and wiki projects, but those ideas aren't on our priority list.