Implement clickstream & navigation vectors as a regular job
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Halfak
	Apr 25 2017, 2:55 PM

Description

https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

Both of these projects are cool and they have produced some really valuable datasets. With @ellery moving out of the WMF, these data items are unlikely to get updates in the near future. We should turn them into regular jobs and host them somewhere for download.

Related Objects

Mentioned In: T158972: Spark job to produce clickstream dataset
Mentioned Here: T158972: Spark job to produce clickstream dataset

Event Timeline

Halfak created this task.Apr 25 2017, 2:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2017, 2:55 PM

@Shilad is requesting this as something he'd use to keep WikiBrain up to date and for other research.

I am happy to help with engineering on this if we can find a way to make that work. I've set up navigation-based word2vec pipelines in similar environments (PySpark, Oozie, etc.) in the past.

If we want to go that route, I volunteer to help @Shilad (a long time contributor to Wikimedia technology advancement) do the NDA dance.

@Shilad: data munching for larger wikis might require tricks on splitting jobs in parallel that are not super obvious, I would do a prototype for a small wiki first (simplewiki?), feeding the data to word2vec is probably the easiest part of the job, calculating the data you will be feeding is not trivial

Presumably, @ellery already had jobs to compute and clean up clickstream data working on our infrastructure.

Aha! T158972: Spark job to produce clickstream dataset

Halfak closed this task as a duplicate of T158972: Spark job to produce clickstream dataset .Apr 25 2017, 3:33 PM

Halfak mentioned this in T158972: Spark job to produce clickstream dataset .Apr 25 2017, 3:36 PM

Presumably, @ellery already had jobs to compute and clean up clickstream data working on our infrastructure.

Code is here: https://github.com/ewulczyn/wiki-clickstream and more info here: https://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/

I am not sure that you would use a similar approach in oozie as the code relies in many intermediate tables (makes total sense for a one-off to work that way but I am not sure it makes that much sense for a job, you are using more machines to compute rather than more intermediate tables to store data to use in later computations)

@Nuria, I've merged this task into one that ya'll already have prioritized. Please continue the conversation in T158972: Spark job to produce clickstream dataset

Implement clickstream & navigation vectors as a regular jobClosed, DuplicatePublicActions

Description

Related Objects

Event Timeline

Implement clickstream & navigation vectors as a regular job
Closed, DuplicatePublic
Actions