Page MenuHomePhabricator

Back-fill pageviews data for dumps.wikimedia.org to May 2015
Closed, ResolvedPublic13 Estimated Story Points

Description

In an upcoming message to analytics-l, we'll propose cleaning up the pageview datasets currently listed on http://dumps.wikimedia.org/other/. To do this, it would be best if we had as many dumps files for the new data as possible. The puppet change to re-organize is being worked on here: https://gerrit.wikimedia.org/r/#/c/269696/

Event Timeline

Milimetric assigned this task to elukey.
Milimetric raised the priority of this task from to Medium.
Milimetric updated the task description. (Show Details)
Milimetric added a project: Analytics.
Milimetric added a subscriber: Milimetric.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2016, 3:41 PM
elukey set Security to None.

Adding more info after a chat with Dan.

The dumps are not visible yet in the /other folder but only in http://dumps.wikimedia.org/other/pageviews/

The 2005/* directories are showing data up to May 2005 but some of them have only project views data, missing page views.

Creating an ad hoc oozie workflow starting from https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/hourly/workflow.xml#L91 might be a good first step.

Documentation: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie#Running_a_real_oozie_example

This [WIP] patch [1] is the one that will add the pageviews dataset to dumps.wikimedia.org and bring some general sanity to the analytics data presented there.

[1] https://gerrit.wikimedia.org/r/#/c/269696/

Milimetric moved this task from Next Up to Paused on the Analytics-Kanban board.Feb 16 2016, 4:45 PM
Milimetric moved this task from Paused to In Progress on the Analytics-Kanban board.

Looks good, @elukey. I saw the output files and they're what I'd expect. I think you can go ahead and start the backfill.

@elukey @Milimetric : Sounds good, but let's wait for encoding-issue-backfilling to be finished :)

Milimetric set the point value for this task to 13.Mar 3 2016, 5:18 PM
Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Mar 11 2016, 4:38 PM
elukey moved this task from Paused to Done on the Analytics-Kanban board.Mar 17 2016, 2:11 PM
elukey closed this task as Resolved.Mar 18 2016, 7:40 AM

@elukey: I forgot to mention, the process is that @Nuria is the only one who closes tasks as resolved. That way she can "accept" that they're done.