Page MenuHomePhabricator

Upgrade daily/monthly aggregations of pageview dumps to new data files
Closed, ResolvedPublic

Description

At present daily and monthly aggregates from raw page views are based on legacy data files [1] (as is stats.grok.se)
There is a new expanded version of these dumps that comprises all page views, also mobile and zero. [2]

[1] https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-raw
[2] https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites

Migration of the aggregator script would be nice.

BTW someday (2015?) [2] may become refurbished and draw its data from a new page view definition, or become obsolete entirely. It remains to be seen if hadoop beats perl for merging 24/31 flat files. (and drawing daily/monthly aggregates directly from hadoop seems costly).

Event Timeline

ezachte claimed this task.
ezachte raised the priority of this task from to Needs Triage.
ezachte updated the task description. (Show Details)

Hm, unless it is very easy, I think you should hold off from making this change. pagecounts-all-sites uses the same pageview definitino as pagecounts-raw, except that it includes mobile (and some other?) data.

Someday, there will be a brand new pagecounts dataset that is built from a 'canonical' wikimedia pagecounts definition, that may even be in a different format than this. You will get more value from switching to this new dataset than to the pagecounts-all-sites one.

It should be easy to migrate aggregation scripts to pagecounts-all-sites (at least in theory), because the new files are on purpose very much downward compatible, almost at the expense of clarity.

In the longer future when we overhaul the entire chain we might find a good moment to drop the current confusing tagging scheme (which became much more confusing from the desire to keep all existing tags compatible)

But ETA 2015 for the ultimate revised data stream is still an open question.So a quick win would still seem a good first step.

Milimetric subscribed.

Closing this task in favor of the other work that we already finished, which was to create the dumps now hosted at: http://dumps.wikimedia.org/other/pageviews/

ezachte closed this task as Resolved.
ezachte set Security to None.

Actually this was done already some two weeks ago as a subtask of https://phabricator.wikimedia.org/T114379

@Milimetrics FYI the dumps use http://dumps.wikimedia.org/other/pageviews/
What makes them still usefull is that they contain page views for all articles (with 5 or more views per month).
Monthly totals, while retaining hourly precision.

Sorry, Erik, I misunderstood. I certainly appreciate the value-added that you mention.