Page MenuHomePhabricator

Generate a monthly pageviews dataset
Open, Needs TriagePublic

Description

The resulting dataset should have these columns:

  • page_namespace -- int, the namespace identifier
  • page_title -- str, the normalized page title
  • month -- str, %Y%m%d%H%i%s
  • views -- int, the number of page loads

This dataset would be useful for intersecting with the monthly article quality prediction dataset (see T145655).

Event Timeline

Halfak created this task.Sep 21 2016, 2:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2016, 2:52 PM
Halfak updated the task description. (Show Details)Sep 21 2016, 3:26 PM
Ghassanmas added a comment.EditedSep 26 2016, 8:29 PM

@Halfak it took 312 second to query the views of Aug/2016 for 5K titles.
On this rate it would take 90 hours to get the views for the +5M articles, moreover there is some issue with mwviews, been trying to solve it. I will be looking for a hack to decrease 90 hours.

@Nuria @Milimetric @ezachte just giving you the heads up that we're preparing a static dataset with historical data with monthly aggregates. This is going to be different from the new PV data but still useful for longitudinal analysis.

We have a monthly dataset [1] , for several years now, with title, month, views, not namespace but that could be inferred from the title I think

@Halfak, that hack could be batch processing?
I takes about 20-30 min to retrieve PV counts for 100k titles from all 50 million entries in an ordered dataset [1]

[1] https://dumps.wikimedia.org/other/pagecounts-ez/merged/

@DarTar: so it sounds like this dataset would try to have page_namespace in addition to the other stuff that's available in pagecounts-ez, right? We figured out a good method to extract that from page titles in Hadoop if @Halfak or whoever is implementing this is interested. Namely, we have a table that maps namespace prefixes to namespace names and functions in scala to get all that. We also have reconstructed page titles historically (taking into account renames) for almost all pages in mediawiki history (the archived pages that have page_id == 0 were also reconstructed with artificial ids).

Let me know if any of that is interesting and I'm happy to show you around this data. Right now it's just intermediary data as we work on the algorithms but this will all be productionized this quarter.

p.s. Apparently "productionized" spellchecks to "product ionized" and I'm thinking there's a good physics explanation there :)