Page MenuHomePhabricator

Spark job to produce clickstream dataset
Closed, ResolvedPublic8 Estimated Story Points

Description

https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

Both of these projects are cool and they have produced some really valuable datasets. With @ellery moving out of the WMF, these data items are unlikely to get updates in the near future. We should turn them into regular jobs and host them somewhere for download.

Code:

Event Timeline

There's some discussion in T163788: Implement clickstream & navigation vectors as a regular job that I merged with this task. In summary:

@Shilad offered to contribute time and code. He has experience with PySpark and Oozie.
@Halfak offered to work with @Shilad to set up an NDA so he could work on our backend.
@Nuria raised concerns about the complexity of computing clickstream stuff as a regular job, but thought that navigation vectors should be easy to compute for a given clickstream dataset.

Nuria lowered the priority of this task from High to Low.Aug 14 2017, 4:15 PM

Update on this to gring @Shilad up to speed:

@JAllemandou, thanks for the pointers! I think there's a little confusion on this, though. I volunteered to productionize Navigation Vectors (see T174796). I'm happy to also work on clickstream once this is done, but I think it will take several months to wrap up Navigation Vectors because of my teaching commitments.

Please let me know if this makes sense!

Last update: Data is vetted and exactly the same as Ellery's using my last patch for enwiki on month 2017-08 :)

@Shilad: My idea was that reviewing this could provide you with interesting knowledge of how/where we store data on the cluster. Let me know if you think it could be valuable.

JAllemandou set the point value for this task to 8.Sep 5 2017, 7:37 AM

@JAllemandou let's please document this dataset throughly once available, i will add it to the goals of next quarter for visibility

@Shilad: i do not think navigation vectors depends on click tream dataset being completed, does it? i will remove it as a subtask.

@JAllemandou Let's document dataset on https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream (cadence, availability) announce it to analytics@ and reserach list before we close this task.

Actually @Nuria this task is only the spark, not the oozie that will make the saprk job run regularly.
I'll modify docs once we have the other one (T175844) done.

Nuria renamed this task from productionize ClickStream dataset to Spark job to produce clickstream dataset .Sep 13 2017, 6:16 PM

Ahh, my mistake!