Page MenuHomePhabricator

Implement clickstream & navigation vectors as a regular job
Closed, DuplicatePublic

Description

https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

Both of these projects are cool and they have produced some really valuable datasets. With @ellery moving out of the WMF, these data items are unlikely to get updates in the near future. We should turn them into regular jobs and host them somewhere for download.

Event Timeline

Halfak created this task.Apr 25 2017, 2:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2017, 2:55 PM

@Shilad is requesting this as something he'd use to keep WikiBrain up to date and for other research.

I am happy to help with engineering on this if we can find a way to make that work. I've set up navigation-based word2vec pipelines in similar environments (PySpark, Oozie, etc.) in the past.

Halfak awarded a token.EditedApr 25 2017, 3:02 PM

If we want to go that route, I volunteer to help @Shilad (a long time contributor to Wikimedia technology advancement) do the NDA dance.

Nuria added a subscriber: Nuria.Apr 25 2017, 3:22 PM

@Shilad: data munching for larger wikis might require tricks on splitting jobs in parallel that are not super obvious, I would do a prototype for a small wiki first (simplewiki?), feeding the data to word2vec is probably the easiest part of the job, calculating the data you will be feeding is not trivial

Presumably, @ellery already had jobs to compute and clean up clickstream data working on our infrastructure.

Nuria added a comment.Apr 25 2017, 3:53 PM

Presumably, @ellery already had jobs to compute and clean up clickstream data working on our infrastructure.

Code is here: https://github.com/ewulczyn/wiki-clickstream and more info here: https://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/

I am not sure that you would use a similar approach in oozie as the code relies in many intermediate tables (makes total sense for a one-off to work that way but I am not sure it makes that much sense for a job, you are using more machines to compute rather than more intermediate tables to store data to use in later computations)

@Nuria, I've merged this task into one that ya'll already have prioritized. Please continue the conversation in T158972: Spark job to produce clickstream dataset