Page MenuHomePhabricator

Make a final run of the sampled log files, using the newest definition
Closed, ResolvedPublic

Description

Once the new pageviews definition has been QAd, make one final run of the sampled logs, ending up with:

YYYYMMDDHH -- country_iso -- country_name -- project -- access_method -- is_spider -- is_automata -- referer_class -- pageviews

This will allow us to obliterate the sampled logs and also give us the data from January 2015, which the non-sampled logs no longer have.

Event Timeline

Ironholds claimed this task.
Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds subscribed.
Ironholds triaged this task as Medium priority.Mar 16 2015, 8:57 PM

WP Zero is requesting to add the isZero dimension into the data. The data in Pentaho has been tremendously useful to them and is desired for preparing the next Quarterly Review.

There are known issues with the data flagged as isZero:

  • it only tracks requests with zero in the x-analytics field or to zero.wikipedia.org
  • this is only a tiny fraction of all zero data

Well, the first is the cause of the second. @Eloquence is this a desired element of the dataset?

This task is purely for maintenance of a legacy dataset that will be replaced with Hadoop-generated data soon. So I would prefer to avoid adding new requirements, unless there's an urgent request that we can only answer in this manner. @kevinator what's your take based on the understanding of the request?

I want to clarify the request: it is only to keep the isZero dimension that was already in the previous versions of the data. I am not asking to fix the issue with it. I just wanted to highlight the issue to the Zero team at the same time.

The Zero team will comment here on their need to keep this dimension in Oliver's final data cube.

Update: after iteration, testing and debugging, a full run is on the way. Will perform a big QA before anything happens, though.

Sorry this took so long; dealing with process failures around actually making the run took as much energy as making the run.

Attached is a high-level graphic of the counts by access method from each version of the cube, from a period that both runs cover (namely 2014). It shows no substantial variation, which leads me to believe the data is sensible. It currently lives in pageviews05 and pentahoviews05 tables in the "staging" analytics-store database: I'll put together the Pentaho schema this morning.

Thanks, @Ironholds!

@ellery just a ping to make sure you've seen this, related to Megan's request.

Done; now in Pentaho. Code attached for future work.

DarTar updated the task description. (Show Details)

Closed. Followed by T96169.