Page MenuHomePhabricator

Create a Historical Link Graph for Wikipedia
Closed, ResolvedPublic

Description

As complementary data for the Clickstream (https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream), we are exploring the possibility of creating a historical dataset for the Wikipedia link graph, for the same languages where the Clickstream data is available. Link information is useful for better understanding the clicks, allowing - among other things - to compute the probability of click, and it also allows studying how topics of the online encyclopedia get connected over time.

Event Timeline

diego triaged this task as Medium priority.Feb 5 2018, 8:05 PM
DarTar renamed this task from Create a Link Graph for Wikipedia Pages to Create a Historical Link Graph for Wikipedia.Feb 7 2018, 5:37 AM
DarTar added a project: Data-release.
DarTar updated the task description. (Show Details)

@DarTar: Do you have any preference for the format of this dataset? I can think in two ways of present it:

i) Get all the links per page revision. Each line will represent a page, like this:
{page_id:page_id, title:title,historical_links:[ {rev_id:first_rev,links:[link_1,..link_n]}, ... ,{rev_id:current_rev, links:[link_1, ..., link_m]}}

ii) In each revision, we will just add the new links, and the removed links.
{page_id:page_id, title:title,historical_links:[ {rev_id:first_rev,new_links:[link_1,..link_n], removed_links:[] }, ... ,{rev_id:current_rev,new_links:[link_x,n], removed_links:[link_y ]}}

Or do you have anything different in mind_

diego changed the task status from Open to Stalled.Mar 19 2018, 5:45 PM
leila lowered the priority of this task from Medium to Low.Mar 19 2018, 5:51 PM

We are considering to collaborate on this with EPFL. To proceed, with need to:

  • Prospect the amount of engineering work required from our (research team) side and discuss with the Analytics team their bandwidth to help on this, specially with an updated parquet version (spark) of the dumps.

Next step: @diego review plans with Analytics (bundled with the discussion on Spark-based dump processing)

leila raised the priority of this task from Low to Medium.May 14 2018, 6:46 PM

Blessing it with a "Normal" priority. ;) It's important. Let's do it.

Milimetric added a subscriber: Milimetric.

Let's collaborate on infrastructure for this