Hi,
related to [[ https://phabricator.wikimedia.org/T149358 | T149358 ]] and [[ https://github.com/MusikAnimal/pageviews/issues/185 | #185@MusikAnimal/pageviews on GitHub ]].
I have already processed all the data from [[ https://dumps.wikimedia.org/other/pagecounts-raw/ | pagecounts-raw ]] and [[ https://dumps.wikimedia.org/other/pagecounts-all-sites/| pagecounts-all-sites]] to sort them by page instead of by hour.
Here's a sample of the original data (from `pagecounts-20071229-120000.gz`):
```
en Albert_Einstein 199 199
en Albert_Einstein%27s_brain 10 10
en Albert_Einstein's_brain 3 3
en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/main.css 1 1
en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/shared.css 1 1
en Albert_Einstein_College_of_Medicine 2 2
en Albert_Einstein_High_School 1 1
en Albert_Einstein_Medal 1 1
```
Here's a sample of the data sorted by page per month:
```
en Albert_Einstein 20071209-180000 51 51
en Albert_Einstein 20071209-190000 471 471
en Albert_Einstein 20071209-200000 545 545
en Albert_Einstein 20071209-210000 546 546
en Albert_Einstein 20071209-220000 497 497
en Albert_Einstein 20071209-230000 564 564
en Albert_Einstein 20071210-000000 540 540
en Albert_Einstein 20071210-010000 567 567
en Albert_Einstein 20071210-020000 547 547
en Albert_Einstein 20071210-030000 557 557
```
The total amount of data (compressed with gzip) is 3.2TB in total (average 31Gb per month, [[https://pastebin.com/raw/cRfdX9Ds|complete list]]) , but as you can see above there is a fair amount of repetition of data.
How can I provide this data so to setup a working API?