Maniphest T193759

Add legacy per-article pagecounts data (prior to 2015)
Open, LowestPublic
Actions

Assigned To

None

Authored By

	CristianCantoro
	May 3 2018, 4:09 PM

Description

Hi,

I have already processed all the data from pagecounts-raw and pagecounts-all-sites to sort them by page instead of by hour.

Here's a sample of the original data (from pagecounts-20071229-120000.gz):

en Albert_Einstein 199 199
en Albert_Einstein%27s_brain 10 10
en Albert_Einstein's_brain 3 3
en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/main.css 1 1
en Albert_Einstein_-_Wikipedia%2C_the_free_encyclopedia_files/shared.css 1 1
en Albert_Einstein_College_of_Medicine 2 2
en Albert_Einstein_High_School 1 1
en Albert_Einstein_Medal 1 1

Here's a sample of the data sorted by page per month:

en Albert_Einstein 20071209-180000 51 51
en Albert_Einstein 20071209-190000 471 471
en Albert_Einstein 20071209-200000 545 545
en Albert_Einstein 20071209-210000 546 546
en Albert_Einstein 20071209-220000 497 497
en Albert_Einstein 20071209-230000 564 564
en Albert_Einstein 20071210-000000 540 540
en Albert_Einstein 20071210-010000 567 567
en Albert_Einstein 20071210-020000 547 547
en Albert_Einstein 20071210-030000 557 557

The total amount of data (compressed with gzip) is 3.2TB in total (average 31Gb per month, complete list) , but as you can see above there is a fair amount of repetition of data.

How can I provide this data so to setup a working API?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T173720 Add pagecounts by article and top pagecounts to AQS
		Open		None	T193759 Add legacy per-article pagecounts data (prior to 2015)

Event Timeline

CristianCantoro created this task.May 3 2018, 4:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2018, 4:09 PM

diego added a project: Analytics.May 3 2018, 4:15 PM

Thanks for creating this! I think it might be covered by T173720

I'm adding this as a subtask, so we don't lose track of the nice work done to collect the data, but unfortunately the blocker here is space. We appreciate the computation and cleaning up of the data, but we need more space on the Cassandra cluster before we can load this into the API. In the meantime, maybe we can host the data on dumps as files?

Milimetric triaged this task as Low priority.May 3 2018, 5:11 PM

Milimetric moved this task from Incoming to Backlog (Later) on the Analytics board.

In T193759#4179085, @Milimetric wrote:

In the meantime, maybe we can host the data on dumps as files?

I am wondering if I can put this data in a format that would be more compact. Any suggestions?

Anyway, I am totally ok with uploading these data, I think I just need a server where to save them.

@CristianCantoro: I'm sorry I didn't think of this, but isn't this what pagecounts-ez already did? https://dumps.wikimedia.org/other/pagecounts-ez/merged/

Oh but you have them going back to 2007. Hm, maybe using your data we can resolve this task: https://phabricator.wikimedia.org/T188041 and then it would be available in a fairly compressed way for everyone.

In T193759#4190449, @Milimetric wrote:

@CristianCantoro: I'm sorry I didn't think of this, but isn't this what pagecounts-ez already did? https://dumps.wikimedia.org/other/pagecounts-ez/merged/

Oh but you have them going back to 2007. Hm, maybe using your data we can resolve this task: https://phabricator.wikimedia.org/T188041 and then it would be available in a fairly compressed way for everyone.

Happy to help, let's continue there for the moment.

CristianCantoro updated the task description. (Show Details)May 8 2018, 2:31 PM

• Vvjjkkii renamed this task from Add legacy per-article pagecounts data (prior to 2015) to 4odaaaaaaa.Jul 1 2018, 1:12 AM

• Vvjjkkii raised the priority of this task from Low to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

• Community_Tech_bot renamed this task from 4odaaaaaaa to Add legacy per-article pagecounts data (prior to 2015).Jul 1 2018, 6:20 AM

• Community_Tech_bot updated the task description. (Show Details)

• Community_Tech_bot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

• Community_Tech_bot added a subscriber: Aklapper.

CommunityTechBot lowered the priority of this task from High to Low.Jul 3 2018, 3:25 AM

mforns moved this task from Backlog (Later) to Analytics Query Service on the Analytics board.Aug 10 2020, 4:11 PM

mforns moved this task from Analytics Query Service to Datasets on the Analytics board.Aug 10 2020, 4:14 PM

odimitrijevic added a project: Data-Engineering.Jan 6 2022, 3:22 AM

odimitrijevic lowered the priority of this task from Low to Lowest.Jan 6 2022, 3:39 AM

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:20 AM

JArguello-WMF moved this task from Datasets to Data Products & Metrics on the Data-Engineering board.Jun 29 2023, 11:45 PM

lbowmaker moved this task from Data Products & Metrics to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 2:25 PM

Add legacy per-article pagecounts data (prior to 2015)Open, LowestPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add legacy per-article pagecounts data (prior to 2015)
Open, LowestPublic
Actions

Related Objects
Search...