Page MenuHomePhabricator

Create white list for pageview data {hawk} [8 pts]
Closed, ResolvedPublic

Description

Regular expressions for:

  • project + project class
  • access method
  • agent_type
  • continent
    • country_code

Event Timeline

JAllemandou raised the priority of this task from to High.
JAllemandou updated the task description. (Show Details)
JAllemandou added a project: Analytics-Backlog.
JAllemandou added a subscriber: JAllemandou.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2015, 4:43 PM
Nuria set Security to None.Sep 1 2015, 7:25 PM
Nuria added a subscriber: Nuria.
Nuria claimed this task.Sep 2 2015, 2:02 PM
Nuria added a subscriber: Ironholds.EditedSep 2 2015, 10:50 PM

Describing a bit more in detail what this ticket is about:

In the wmf database in hive there are two tables that store pageviews: webrequest and pageview_hourly.
The webrequest table stores processed requests that might or might not be pageviews and
that is marked by a boolean field. The pageview_hourly table stores only pageviews.

The data on the table pageview_hourly should match the definition documented here:
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters

There is also a log, on
https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the
pageview definition.

Recently we (well...@Ironholds) noticed that the pageview_hourly table included pageviews from outreach domain which it shouldn't have.

This task is about finding a system/config/code changes that prevent that from occurring again.
The regular expressions that assess what a pageview is could be updated.
But we could also have a config file that stores the domains we accept (en.wikipedia.com, commons.wikipedia.com...) and
we could look up every pageview that we are tagging as such wether it belong s to those domains.
Perhaps the "quality assurance" of the data can be a process
that is run on the pageview_hourly table after records have been moved there.

We are not changing the pageview definition, we know it lets things go through that might not be not be pageviews but those will be catched in a quality(guard) process after. Results from quality process might affect pageview definiton code at a later time.

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 13 2015, 6:14 PM

Holder tables for authorized and non authorized values are here: https://gerrit.wikimedia.org/r/#/c/240099/5

Change 246118 had a related patch set uploaded (by Nuria):
[WIP] Pageview Hourly quality check whitelist

https://gerrit.wikimedia.org/r/246118

Change 246118 abandoned by Nuria:
[WIP] Pageview Hourly quality check whitelist

Reason:
Changes on another patch.

https://gerrit.wikimedia.org/r/246118

Nuria closed this task as Resolved.Oct 28 2015, 2:46 PM