Creation of canonical pageview dumps for users to download
[The following added by @fdans]
**Deliverable**
The idea with this project it to replace Pagecounts-EZ with a dump that:
- Spans the same timerange (2011 to the present).
- But we should probably include @CristianCantoro's data from 2008 to 2011 (T188041). This alone would add huge value to the dataset and more than justify this project
- Contains hourly pageview data for all Wikimedia sites:
- Do we separate app traffic? Right now EZ includes it in mobile.
- Do we keep reporting only user traffic? Or do we add bots?
- Uses correct, standard wiki identifiers (e.g. `de.wikisource`) as opposed to WebstatsCollector wiki codes (`de.s`, `es.z`, etc).
- Doesn't have its data skewed by one hour (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pagecounts-ez#One_hour_skewing_issue).
- Is generated as part of the Pageview Hourly coordinator in Hadoop, as opposed to a script running as a cron job.
- Probably has no DIY compression. I'm sure bzip2 's compression makes shortening highly repeated values like en.wikipedia unnecessary, but we can test.
**Format**
The only fundamental issue with the current row format is the wiki codes. Secondarily there's the fact that we nowadays distinguish traffic between the desktop site, the mobile site, and the apps. So the format could be like:
```
en.wikipedia mobile Michelle_Obama 2629 A113B112C101D129E118F92G68H88I54J58K39L87M73N80O184P143Q140R138S147T133U137V128W142X125
```
I think we can totally do away with the hourly data encoding because it's confusing, it doesn't save space, and storing numbers separated by spaces is probably better for compression than one big block of alphanumerical values (UPDATE: not true). So most likely the row would look like this (UPDATE: nope, will look like the above one)
```
en.wikipedia mobile Michelle_Obama 2629 113 112 101 129 118 92 68 88 54 58 39 87 73 80 184 143 140 138 147 133 137 128 142 125
```
Update: I did a quick test by compressing those two lines in each file and the compressed version of the full named, no number encoding is 35 bytes lighter.
**Backfilling**
There are three parts to backfill on this dataset:
# - As mentioned above, the 2008-2011 obtained from pagecounts-raw by @CristianCantoro
# - The part that was generated via WebstatsCollector from 2011 to 2016, which doesn't include mobile pageviews.
# - The present part, generated from the Pageview Hourly legacy dumps.
Part 1 will probably have to go through the same corrections as 2 (correct wiki names, un-skewing of data). Part 3 will be backfilled from Pageview Hourly in order to fix the [[ https://phabricator.wikimedia.org/T249984#6093223 | local chapters-mobile wikipedia conflict ]]. This problem doesn't affect parts 1 and 2 because no mobile pageview data was available back then.
**Access/docs**
The [[ https://dumps.wikimedia.org/other/pagecounts-ez/ | dumps site ]] should be redone in a way that classifies our available datasets according to their distinct features, as opposed to the current form which is more like "here's a list of all our mildly different pageview datasets!".