Page MenuHomePhabricator

Spark job to produce clickstream dataset
Closed, ResolvedPublic8 Story Points

Description

https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

Both of these projects are cool and they have produced some really valuable datasets. With @ellery moving out of the WMF, these data items are unlikely to get updates in the near future. We should turn them into regular jobs and host them somewhere for download.

Code:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 24 2017, 4:59 PM
Nuria moved this task from Incoming to Dashiki on the Analytics board.Mar 2 2017, 5:23 PM

There's some discussion in T163788: Implement clickstream & navigation vectors as a regular job that I merged with this task. In summary:

@Shilad offered to contribute time and code. He has experience with PySpark and Oozie.
@Halfak offered to work with @Shilad to set up an NDA so he could work on our backend.
@Nuria raised concerns about the complexity of computing clickstream stuff as a regular job, but thought that navigation vectors should be easy to compute for a given clickstream dataset.

Halfak updated the task description. (Show Details)Apr 25 2017, 3:36 PM
Ladsgroup added a subscriber: Ladsgroup.
Milimetric triaged this task as High priority.May 8 2017, 2:48 PM
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.May 16 2017, 12:51 PM
Nuria moved this task from Backlog (Later) to Dashiki on the Analytics board.Jul 17 2017, 5:50 PM
DarTar added a subscriber: DarTar.
Nuria lowered the priority of this task from High to Low.Aug 14 2017, 4:15 PM
JAllemandou claimed this task.

Update on this to gring @Shilad up to speed:

@JAllemandou, thanks for the pointers! I think there's a little confusion on this, though. I volunteered to productionize Navigation Vectors (see T174796). I'm happy to also work on clickstream once this is done, but I think it will take several months to wrap up Navigation Vectors because of my teaching commitments.

Please let me know if this makes sense!

Last update: Data is vetted and exactly the same as Ellery's using my last patch for enwiki on month 2017-08 :)

@Shilad: My idea was that reviewing this could provide you with interesting knowledge of how/where we store data on the cluster. Let me know if you think it could be valuable.

JAllemandou set the point value for this task to 8.Sep 5 2017, 7:37 AM
Nuria added a comment.Sep 11 2017, 4:13 AM

@JAllemandou let's please document this dataset throughly once available, i will add it to the goals of next quarter for visibility

Nuria added a comment.Sep 13 2017, 2:46 PM

@Shilad: i do not think navigation vectors depends on click tream dataset being completed, does it? i will remove it as a subtask.

Nuria added a comment.Sep 13 2017, 3:24 PM

@JAllemandou Let's document dataset on https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream (cadence, availability) announce it to analytics@ and reserach list before we close this task.

Actually @Nuria this task is only the spark, not the oozie that will make the saprk job run regularly.
I'll modify docs once we have the other one (T175844) done.

Nuria renamed this task from productionize ClickStream dataset to Spark job to produce clickstream dataset .Sep 13 2017, 6:16 PM

Ahh, my mistake!

Nuria closed this task as Resolved.Sep 13 2017, 6:17 PM