Both of these projects are cool and they have produced some really valuable datasets. With @ellery moving out of the WMF, these data items are unlikely to get updates in the near future. We should turn them into regular jobs and host them somewhere for download.
I am happy to help with engineering on this if we can find a way to make that work. I've set up navigation-based word2vec pipelines in similar environments (PySpark, Oozie, etc.) in the past.
@Shilad: data munching for larger wikis might require tricks on splitting jobs in parallel that are not super obvious, I would do a prototype for a small wiki first (simplewiki?), feeding the data to word2vec is probably the easiest part of the job, calculating the data you will be feeding is not trivial
Presumably, @ellery already had jobs to compute and clean up clickstream data working on our infrastructure.
Code is here: https://github.com/ewulczyn/wiki-clickstream and more info here: https://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/
I am not sure that you would use a similar approach in oozie as the code relies in many intermediate tables (makes total sense for a one-off to work that way but I am not sure it makes that much sense for a job, you are using more machines to compute rather than more intermediate tables to store data to use in later computations)