Hi,
related to T149358 and #185@MusikAnimal/pageviews on GitHub.
I have already processed all the data from pagecounts-raw and pagecounts-all-sites to sort them by page instead of by hour.
Here's a sample of the original data (from pagecounts-20071229-120000.gz):
en Albert_Einstein 199 199 en Albert_Einstein%27s_brain 10 10 en Albert_Einstein's_brain 3 3 en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/main.css 1 1 en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/shared.css 1 1 en Albert_Einstein_College_of_Medicine 2 2 en Albert_Einstein_High_School 1 1 en Albert_Einstein_Medal 1 1
Here's a sample of the data sorted by page per month:
en Albert_Einstein 20071209-180000 51 51 en Albert_Einstein 20071209-190000 471 471 en Albert_Einstein 20071209-200000 545 545 en Albert_Einstein 20071209-210000 546 546 en Albert_Einstein 20071209-220000 497 497 en Albert_Einstein 20071209-230000 564 564 en Albert_Einstein 20071210-000000 540 540 en Albert_Einstein 20071210-010000 567 567 en Albert_Einstein 20071210-020000 547 547 en Albert_Einstein 20071210-030000 557 557
The total amount of data (compressed with gzip) is 3.2TB in total (average 31Gb per month, complete list) , but as you can see above there is a fair amount of repetition of data.
How can I provide this data so to setup a working API?