Page MenuHomePhabricator

Aggregate pageviews to Wikidata entities
Open, LowPublic

Description

Would it be possible to produce a daily file in http://dumps.wikimedia.org/ with daily pageviews aggregated to Wikidata entities?

I believe this would be super useful for Wikidata maintenance, since it would rank the entities. For example, the CSV file for a given day would contain lines such as "Q7197 678901", meaning that entity Q7197 got 678901 total pageviews on that day, aggregated over all Wikimedia pages for that entity (in any language). From this, it would be easy to extract the top-N entities.

It’s definitely possible to compute this oneself from the existing downloadable files, but it’s painful to keep the analytics jobs running. Therefore, I’d find it great if Wikimedia could produce such a file. I’d even volunteer to write the code, but wouldn’t know where to start (which programming language, etc.)

Event Timeline

Sascha created this task.Feb 6 2019, 5:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 6 2019, 5:00 PM
fdans triaged this task as Low priority.Feb 7 2019, 6:06 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

If nobody else has time to do this, may I volunteer to write the code? Please tell me where to start (which programming language, what framework, etc.)

Friendly ping?

Meanwhile, I’ve built an ad-hoc “pipeline” for computing this on a laptop. But I still think that having this metric would be generally useful for all kinds of projects, both for users of Wikidata and for maintainers of various Wikimedia projects. Therefore, I’d like to kindly ask to reconsider the current low priority of the task. For example, here’a a map of Swiss castles which ranks geographic features based on this metric. For further illustration, please find below some sample pageview counts aggregated over all languages on all Wikimedia projects during two weeks in March 2019:

#rank		qid		pageviews	description
1		Q5296		86904509	Wikimedia Main Page (*.wikipedia.org, *.wikisource.org, etc.)
2		Q318165		6203010		Luke Perry (American actor, A Famous Teen Idol, Coy Luther III Perry)
10		Q468054		1441024		Mick Mars (American musician, Mötley Crüe)
100		Q28164181	330560		2019 Indian general election
249		Q22686		209174		Donald Trump (45th and current president of the United States)
1000		Q319877		104766		Alpralozam (Xanax)
5203		Q72   	  	44610		Zürich (capital of the canton of Zürich, Switzerland)
8954		Q7197		33629		Simone de Beauvoir (French writer)
10000		Q9036623	31456		Kotori Shigemoto (Japanese fashion model)
12891		Q8819		26728		Unicode (technical standard)
45620		Q486860		8998		Mountain View (city in Santa Clara County, California, United States)
100000		Q204234		4430		Milton Keynes (town in Buckinghamshire, England)
232324		Q286345		2128		Shift_JIS (character encoding)
497170		Q688539		762		Rapperswil (town on Lake Zurich, part of Rapperswil-Jona, Canton of St. Gallen, Switzerland)
1000000		Q1752709	313		cretic (metrical foot)
13801807	Q18618629	2		Denny Vrandečić (Croatian computer scientist)

My current implementation is quite trivial. It works as follows:

  1. The pagecounts-ez files are processed to build a page -> count mapping for each day, where page is the lowercased key from the original file, and count is the sum of the pageview counts for that page. The lowercasing was done because the current pagecounts-ez file sometimes lists multiple entries for the same Wikimedia page, using different casing variants for the page title. In my implementation, I’ve used an internationalization-aware lowercasing function because some languages (such as Turkish or Azeri) have special lowercasing rules; this might be over-engineered. Also, the implementation maps mobile and desktop pageviews to the same key, since this distinction is irrelevant for the metric. The output of this step is sorted by page key.
  2. The Wikidata JSON dump is processed to build a page -> qid mapping where page has the same format as above. Each JSON entry gets mapped to a (potentially) large set of mapping items. Again, the output of this step is sorted by page key.
  3. The mappings are joined. Because the outputs of the previous steps are sorted in the same order, the joining can be done by a linear scan through the input files, which is much faster than doing table lookups for each item.

This is probably quite easy to implement in a framework like Hadoop, or any other big data framework which Wikimedia might be using for its analytics. As said, I’ll gladly write code for the Wikimedia Foundation if you tell me which framework and programming language you’d like me to use.