Page MenuHomePhabricator

Backfill pageview data for March 2015 from sampled logs before transition to UDF-based reports as of April
Closed, ResolvedPublic

Description

We had a meeting this morning with the people CC'ed on this task to discuss the handoff of pageview reports from Research (sampled logs + R implementation) to Analytics (unsampled logs + UDF).

Starting with April, daily and monthly data will be generated via the UDFs and include the following dimensions:

  1. project, e.g. enwiki
  2. language, e.g. en
  3. period, e.g. 2015-04-01
  4. access method (desktop site/ mobile web)
  5. country (country_iso, country_name)
  6. is_spider

We're handing off the generation of this data to Analytics Eng, and the team will set up systems to allow customers to access this data and compute arbitrary aggregations. As part of this transition, and to ensure we have a complete data series based on the new pageview definition to cover calendar Q1-2015, we would like to request backfilling the data on the staging DB for the entire month of March 2015.

@Eloquence, can you approve this for Oliver to help with this task?

Other minutes from the meeting are here: http://etherpad.wikimedia.org/p/PVTransition

Event Timeline

DarTar raised the priority of this task from to Needs Triage.
DarTar updated the task description. (Show Details)

@DarTar, if you and Oliver can work out a plan that works for both of you, I am fine with it.

We're handing off the generation of this data to Analytics Eng, and the team will set up systems to allow customers to access this data and compute arbitrary aggregations.

Please note that "compute" arbitrary aggregations is not part of the initial work to be done on this feature that includes only daily/weekly pageview aggregations (like the ones we currently calculate for the "legacy" pageviews)

@Nuria the scope of this task is only to parse the sampled logs for the month of March using Oliver's R code and put it into the Staging DB (where all the source data for Pentaho lives).

The "compute" here refers to an analyst or researcher (on behalf of the COO or FR or comms) writing a query to the Staging DB (Analytics-Store) with some sort of aggregation. For example someone needs to query for the number of pageviews for the quarterly report.

On a side note:
I have told @JAllemandou that it's a higher priority to produce (fine-grain) pageview data T96314 with the dimensions enumerated above and to make it accessible via a SQL interface. Hence Andrew and Joseph are looking into Impala and generating this data using the cluster starting in April.

There isn't as much pressure to get daily/weekly pageviews for Vital Signs and I believe the infrastructure we're building now will satisfy later use cases including Vital Signs

Note that when the task says "help with" it means "do all of".