Page MenuHomePhabricator

Add legacy per-article pagecounts data (prior to 2015)
Open, LowestPublic

Description

Hi,

related to T149358 and #185@MusikAnimal/pageviews on GitHub.

I have already processed all the data from pagecounts-raw and pagecounts-all-sites to sort them by page instead of by hour.

Here's a sample of the original data (from pagecounts-20071229-120000.gz):

en Albert_Einstein 199 199
en Albert_Einstein%27s_brain 10 10
en Albert_Einstein's_brain 3 3
en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/main.css 1 1
en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/shared.css 1 1
en Albert_Einstein_College_of_Medicine 2 2
en Albert_Einstein_High_School 1 1
en Albert_Einstein_Medal 1 1

Here's a sample of the data sorted by page per month:

en Albert_Einstein 20071209-180000 51 51
en Albert_Einstein 20071209-190000 471 471
en Albert_Einstein 20071209-200000 545 545
en Albert_Einstein 20071209-210000 546 546
en Albert_Einstein 20071209-220000 497 497
en Albert_Einstein 20071209-230000 564 564
en Albert_Einstein 20071210-000000 540 540
en Albert_Einstein 20071210-010000 567 567
en Albert_Einstein 20071210-020000 547 547
en Albert_Einstein 20071210-030000 557 557

The total amount of data (compressed with gzip) is 3.2TB in total (average 31Gb per month, complete list) , but as you can see above there is a fair amount of repetition of data.

How can I provide this data so to setup a working API?

Event Timeline

Thanks for creating this! I think it might be covered by T173720

Milimetric subscribed.

I'm adding this as a subtask, so we don't lose track of the nice work done to collect the data, but unfortunately the blocker here is space. We appreciate the computation and cleaning up of the data, but we need more space on the Cassandra cluster before we can load this into the API. In the meantime, maybe we can host the data on dumps as files?

Milimetric moved this task from Incoming to Backlog (Later) on the Analytics board.

In the meantime, maybe we can host the data on dumps as files?

I am wondering if I can put this data in a format that would be more compact. Any suggestions?

Anyway, I am totally ok with uploading these data, I think I just need a server where to save them.

@CristianCantoro: I'm sorry I didn't think of this, but isn't this what pagecounts-ez already did? https://dumps.wikimedia.org/other/pagecounts-ez/merged/

Oh but you have them going back to 2007. Hm, maybe using your data we can resolve this task: https://phabricator.wikimedia.org/T188041 and then it would be available in a fairly compressed way for everyone.

@CristianCantoro: I'm sorry I didn't think of this, but isn't this what pagecounts-ez already did? https://dumps.wikimedia.org/other/pagecounts-ez/merged/

Oh but you have them going back to 2007. Hm, maybe using your data we can resolve this task: https://phabricator.wikimedia.org/T188041 and then it would be available in a fairly compressed way for everyone.

Happy to help, let's continue there for the moment.

Vvjjkkii renamed this task from Add legacy per-article pagecounts data (prior to 2015) to 4odaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii raised the priority of this task from Low to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot lowered the priority of this task from High to Low.Jul 3 2018, 3:25 AM
odimitrijevic lowered the priority of this task from Low to Lowest.Jan 6 2022, 3:39 AM
odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.