Add pageview stats to the action API
Closed, ResolvedPublic

Description

Include Pageview API data into the action API. Use cases:

  • allow convenient access by the various tools (PWB, mw.Api etc) which already implement an interface to action API
  • for products which already need to fetch page information from the action API, allow freeriding on that request to get pageview data
  • (probably) get pageview information for a list of pages in a single request
  • (maybe) use pageview information to filter/sort lists, e.g. most viewed pages in a given category

This will probably mean three new API modules:

  • a meta module for returning site totals (the aggregate endpoint` of the Pageview API)
  • a list module for returning most viewed pages (the top endpoint), possibly with some filtering options (namespace filtering seems useful; filtering by category or user contributions/watchlist would be useful but probably not feasible)
  • a prop module for getting per-page view data (the per-article endpoint)

Per earlier email discussion, we probably want to avoid the complexity of arbitrary date ranges and just return some commonly used ranges (e.g. daily sums for the last 60 UTC days, with a 2-day offset to avoid incomplete results). We probably also want to disallow useragent/access-mode filtering.

This is a Q2 goal for Reading Infrastructure.

Tgr created this task.Sep 6 2016, 10:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2016, 10:08 PM
Nuria moved this task from Incoming to Radar on the Analytics board.Sep 8 2016, 4:46 PM
Tgr edited the task description. (Show Details)Sep 13 2016, 12:04 AM

Would this be a proxy to the pageview API or would we store some of the data in the MW database?

Tgr added a comment.Sep 13 2016, 12:51 AM

Some open questions:

  • The pageview REST API deals with URLs; the action API deals with pages. There are many ways in which the two do no align (page moves, redirects, views of a red link before the article was created, normalization issues). How to best bridge that gap? We probably want to go for the easy solution here and whenever the action API is asked for pageview data for a page, just request REST API data for the current canonical title of the page. Conversely, when generating a list of top viewed articles, just accept the fact that the list might include duplicate entries if title normalization differs between the REST API and the action API or if the user specified redirects=1.
  • How to deal with batch queries? Action API clients can ask for up to 5000 pages in a single request. The REST API currently only returns information about a single page per request. Options include:
    • Access the data backend of the REST API (Cassandra) directly
    • Set up some kind of daily data export to the wiki DB, or maybe to memcached (this seems infeasible given the size of the data)
    • Fetch data on demand (probably with some kind of caching) and extend the REST API to allow querying multiple pages at the same time (does not fit well into the REST paradigm, although not completely unprecedented: ORES already does that)
    • Fetch data on demand (probably with some kind of caching) for the first X pages, return no data for the rest (much like $wgAPIMaxUncachedDiffs)
  • How to do filtering in the list module? Things like "top viewed pages in Category:X" would be quite useful, but proper filtering would require having the data in the DB. Then again we are (currently) limited to 1000 results anyway so unless we can change that filtering is unlikely to be useful anyway (or if it isn't we can just do it in PHP).
Tgr added a comment.Sep 13 2016, 12:52 AM

@Legoktm I hope that non-answers your question :)

Tgr edited the task description. (Show Details)Sep 13 2016, 12:55 AM
Tgr added a comment.Sep 13 2016, 1:20 AM

Also, if this is written as an extension (which it probably will be), how (and to what extent) should we avoid making it Wikimedia-specific? MediaWiki core has a pageview counting feature (which no one uses [citation needed]), and other installations might have their own way of accessing pageview data. The low-hanging fruit here seems to be to define a service for getting the data and use dependency injection to make it replacable.

I was told that the pageviews API was Wikimedia-specific, so I created the WikimediaPageViewInfo extension for displaying its counts on action=info (T43327), you're more than welcome to put it in there. Or write something general that E:HitCounters can also interface with?

Tgr added a comment.Sep 13 2016, 1:33 AM

Should we also expose unique devices data? I guess there is no reason not to; it's per-site only so most of the complexity with pageview data does not apply.

Anomie added a subscriber: Anomie.Sep 13 2016, 2:48 PM

MediaWiki core has a pageview counting feature (which no one uses [citation needed]),

Wasn't that removed in rMW90d90dad6e4d: Remove hitcounters and associated code and rMW5f8edb2c0a24: Drop ss_total_views and page_counter fields from MediaWiki?

and other installations might have their own way of accessing pageview data.

Two ways to go there, either make an interface with one implementation or just let those other installations write their own extension.

Tgr added a comment.EditedSep 13 2016, 5:46 PM

You are right, it was migrated to HitCounters, which actually it easier to check how widely it is used (~100 wikis apparently). Wikis using Piwik/OWA/GA probably also have a way to fetch this data. So an interface would make sense, if it can be made sufficiently flexible without too much work.

A first attempt:

interface PageViewService {
    /** Page view count */
    const METRIC_VIEW = 'view';
    /** Unique visitors (devices) for some period, typically last 30 days */
    const METRIC_UNIQUE = 'unique';

    /** Return data for a given article */
    const SCOPE_ARTICLE = 'article';
    /** Return a list of the top articles */
    const SCOPE_TOP = 'top';
    /** Return data for the whole site */
    const SCOPE_SITE = 'site';

    /**
     * Returns an array of recent page analytics data, in the format
     *   title => [ date => count, ... ]
     * where date is in TS_MW format. Which days to return and detailed date semantics
     * (such as time zone) are left to the implementer.
     * @param Title[] $titles
     * @param string $metric One of the METRIC_* constants.
     * @return array
     */
    public function getPageData( array $titles, $metric = self::METRIC_VIEW );

    /**
     * Returns an array of recent analytics totals for the whole site, in the format
     *   date => count
     * where date is in TS_MW format. Which days to return and detailed date semantics
     * (such as time zone) are left to the implementer.
     * @param string $metric One of the METRIC_* constants.
     * @return array
     */
    public function getSiteData( $metric = self::METRIC_VIEW );

    /**
     * Returns a list of the top pages according to some metric, sorted in descending order
     * by that metric. The format is the same as getPageData(), except data for some or
     * all titles can be replaced with null if it is costly or impossible to get.
     * @param string $metric One of the METRIC_* constants.
     * @return array
     */
    public function getTopPages( $metric = self::METRIC_VIEW );

    /**
     * Whether the service can provide data for the given metric/scope combination.
     * @param string $metric One of the METRIC_* constants.
     * @param string $scope One of the METRIC_* constants.
     * @return boolean
     */
    public function supports( $metric, $scope );

    /**
     * Whether the service can provide data for the given metric/scope combination.
     * @param string $metric One of the METRIC_* constants.
     * @param string $scope One of the METRIC_* constants.
     * @return boolean
     */
    public function supports( $metric, $scope );

    /**
     * Returns the number of times it is acceptable to call getPageData in a single request.
     * Jobs and non-performance-sensitive use cases might ignore this.
     * @param string $metric One of the METRIC_* constants.
     * @return int|false False for unlimited.
     */
    public function getPageLimit( $metric );

    /**
     * Returns the length of time for which it is acceptable to cache the results.
     * Typically this would be the end of the current day in whatever timezone the data is in.
     * @param string $metric One of the METRIC_* constants.
     * @param string $scope One of the METRIC_* constants.
     * @return int Time in seconds
     */
    public function getCacheExpiry( $metric, $scope );
}
public function getPageData( Title $title, $metric = self::METRIC_VIEW );

Ideally there would be a proper batched function in the interface instead of repeated calls to this. The lack of batching in the pageviews API seems like an unfortunate side effect of RESTy thinking rather than a limitation inherent in the problem space.

Tgr added a comment.Sep 13 2016, 7:04 PM

Good point, updated.

@Tgr, yes, I would recommend getting the unique devices count with the 60 day approach mentioned in the description and email thread.

Regarding the description's comment about defaulting to the all-access parameter for the access field, I'm inclined to agree.

@JMinor @Dbrant @ovasileva @JKatzWMF @Nirzar @RHo (CC @bearND, @Fjalapeno) when you think about subfiltering modes for pageview information, does the specification above seem sufficient?

As you likely recall, the pageview and uniques counts are available as per the specs at https://wikimedia.org/api/rest_v1/. Would it be safe to say that you might be interested to have pageview data embedded in the response on a per-title basis when API.php calls are made one title at a time? What about when calls are made with multiple titles (or a generator returns multiple items)?

This comment was removed by Antigng.
schana added a subscriber: schana.Sep 16 2016, 10:31 AM
elukey added a subscriber: elukey.Sep 16 2016, 3:46 PM
Nuria added a subscriber: Nuria.EditedSep 19 2016, 6:16 PM

@Tgr: summing up our conversation.

Adding batching to the API at the AQS level can be done but there is homework to do before doing so. Besides the obvious things such us coming up with a sound design scheme for batching calls there are others, such as 1) study the caching scheme for those calls and 2) load test to find out thresholds.

Tests cannot be done until we have out new cluster in service (end of september) so , if you want we can have a small meeting to talk about best design options for the batching endpoints.

On your end you need to provide us with approximate request ratios. Whether those ratios are sustainable depends mostly on the caching scheme.

Change 317452 had a related patch set uploaded (by Gergő Tisza):
[WIP] Add API endpoints

https://gerrit.wikimedia.org/r/317452

Change 320326 had a related patch set uploaded (by Gergő Tisza):
[WIP] Add API endpoints

https://gerrit.wikimedia.org/r/320326

Change 317452 abandoned by Gergő Tisza:
[WIP] Add API endpoints

Reason:
Moved to https://gerrit.wikimedia.org/r/#/c/320326/

https://gerrit.wikimedia.org/r/317452

Change 320326 merged by jenkins-bot:
Add API endpoints

https://gerrit.wikimedia.org/r/320326

Tgr closed this task as "Resolved".Feb 17 2017, 8:18 PM
Tgr claimed this task.