As an end-user I shouldn't see non-articles in the list of top articles
Closed, ResolvedPublic
Actions

Description

The inclusion of pages which a user wouldn't consider an article in the pageview/top API drastically hinders the usefulness of any feature using this data. Solving this problem at the lowest API level possible will allow downstream API clients (including middleware services) to use this data to build features with confidence (and w/o regexes and heuristics).

For example: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2016/01/18

"articles": [
  {
    "article": "Main_Page",
    "views": 19257663,
    "rank": 1
  },
  {
    "article": "Special:Search",
    "views": 2144393,
    "rank": 2
  },
  {
    "article": "-",
    "views": 758591,
    "rank": 3
  },
  ...

Related Objects

Mentioned In: T124716: EPIC: Add Top read articles to the app

Event Timeline

• BGerstle-WMF created this task.Jan 19 2016, 7:57 PM

• BGerstle-WMF raised the priority of this task from to Needs Triage.

• BGerstle-WMF updated the task description. (Show Details)

• BGerstle-WMF added projects: Analytics, Web-Team-Backlog, Wikipedia-iOS-App-Backlog.

• BGerstle-WMF moved this task to Incoming on the Analytics board.

• BGerstle-WMF added subscribers: • BGerstle-WMF, dr0ptp4kt, Milimetric.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 19 2016, 7:57 PM

• BGerstle-WMF moved this task from Needs Triage to Tracking on the Wikipedia-iOS-App-Backlog board.Jan 19 2016, 7:57 PM

• BGerstle-WMF triaged this task as High priority.Jan 19 2016, 8:01 PM

• BGerstle-WMF updated the task description. (Show Details)

• BGerstle-WMF set Security to None.

• BGerstle-WMF added subscribers: Jdlrobson, Mhurd.

For example top.hatnote.com uses a meta query to filter out a site's main page, "-", and anything outside the main namespace:

From https://github.com/hatnote/top/blob/master/top/get_data.py#L54:

def is_article(title, wiki_info):
    '''\
    Is it an article, or some other sort of page? We'll want to filter out the
    search page (Special:Search in English, etc) and similar pages appearing
    inconveniently in the traffic report.
    '''
    skip = ['-'] + [wiki_info['mainpage']]
    prefixes = PREFIXES + wiki_info['namespaces']
    if title in skip:
        return False
    for prefix in prefixes:
        if title.startswith(prefix + ':'):
            return False
    return True

Aye, I had to do this as well in https://gerrit.wikimedia.org/r/#/c/225485/1/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RealtimeTrendingPages.scala

Doing this kind of heuristic as part of the API call is possible, then clients would get less than 1000 articles but I assume that's fine. But to truly cleanly report only Main Namespace articles, we need to access page table data from mediawiki databases in real-time. And right now the data crunched for the API is crunched by Hive in batches. So we're missing this bridge from mediawiki data to the Hadoop cluster. Building this bridge is a very high priority for us, but not an easy task. If someone sees a better way, I'm certainly open to it. Maybe I'm bundling too many problems together here and there's a vertical slice I'm missing.

Milimetric moved this task from Incoming to Event Platform on the Analytics board.Jan 21 2016, 6:32 PM

we need to access page table data from mediawiki databases in real-time. And right now the data crunched for the API is crunched by Hive in batches. So we're missing this bridge from mediawiki data to the Hadoop cluster.

there's no way to denormalize that info when writing data to hadoop?

there's no way to denormalize that info when writing data to hadoop?

Not right now, for two reasons:

whatever we query for that data would have to deal with the full firehose (200k / second). Cache would help, of course, because that data doesn't change very much, but we'd be essentially doubling our traffic instead of just joining to data we already have on the back-end. So joining seems like a better long term solution

page_id is not consistently coming across with all types of requests, we're working on improving coverage of that though

another thing that's not obvious is how clients show pageview data across time zones. will need to talk more about the mechanics of when the pageview data updates and how that impacts UX across time zones.

• BGerstle-WMF mentioned this in T124716: EPIC: Add Top read articles to the app.Jan 29 2016, 1:38 PM

• Mholloway added a project: Mobile-Content-Service.May 25 2016, 1:36 PM

• Mholloway moved this task from Incoming to Tracking on the Mobile-Content-Service board.

• Mholloway subscribed.

• Jhernandez removed a project: Web-Team-Backlog.May 25 2016, 4:31 PM

Milimetric moved this task from Event Platform to Wikistats on the Analytics board.Oct 3 2016, 3:58 PM

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.Apr 25 2017, 12:24 PM

• NHarateh_WMF moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 25 2017, 12:25 PM

• NHarateh_WMF moved this task from Tracking to Backlog on the Mobile-Content-Service board.Apr 25 2017, 4:31 PM

• NHarateh_WMF moved this task from Backlog to Incoming on the Mobile-Content-Service board.Apr 25 2017, 4:37 PM

In order to do this we need the page id, and the computation of top pages (rather than pageview api) can decide whether it is pertinent to include the page. But, again, page id is needed.

JAllemandou moved this task from Wikistats to Radar on the Analytics board.May 16 2017, 12:49 PM

@Nuria is this still a valid ticket?

Services consuming the raw pageview data provided by AQS can and do filter out non-articles. This is a solved problem.

We still want to improve our endpoints in the future, allowing users to get top articles in specific namespaces, categories, and wiki projects, but those ideas aren't on our priority list.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

As an end-user I shouldn't see non-articles in the list of top articlesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

As an end-user I shouldn't see non-articles in the list of top articles
Closed, ResolvedPublic
Actions