Page MenuHomePhabricator

Track overall traffic, without any filtering, broken down into major categories, for internal use.
Closed, ResolvedPublic

Description

We now have several hive tables with aggregated sanitized data, like pageview_hourly, and more will no doubt follow. We also need a high level overview of overall traffic with every request served by us accounted for. Data could be tagged (and broken down e.g. by mime type, being pageview etc), but no filtering whatsoever. Doing a 1:100 sampled hive query would suffice.

This will allow us to monitor whether the filters that we use for other tables may be losing touch with evolving reality, so that we reject too much.
Also it can help us to track amount of suspicious traffic (botnet etc).

A simple report (for internal use) could tell us if percentage of sanitized page views is changing.

Event Timeline

ezachte assigned this task to Nuria.
ezachte raised the priority of this task from to Needs Triage.
ezachte updated the task description. (Show Details)
ezachte added a project: Analytics.

+1, but I think this should go out publicly and be graphed along the pageview stream. This way discrepancies will be obvious and we can use everyone's eyes to keep watch.

We're going to try to accomplish this via loading wmf.webrequest data into Druid without the page_title dimension. We'll keep it in the backlog to remind ourselves.

Milimetric triaged this task as Medium priority.Mar 7 2016, 5:23 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.