Summary of meeting with @JAllemandou about some of the challenges / opportunities we might have for adding more wikis to the existing clickstream. The good news is that the scalability of the job isn't so much of an issue (size of data being processed won't change much) but there are some tweaks that are likely needed for the oozie coordination and just generally useful improvements:
Existing code:
* Main job: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala
* Scheduler / move files: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/clickstream
Scalability challenges:
[] The oozie scheduler ([[https://github.com/wikimedia/analytics-refinery/blob/1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3/oozie/clickstream/coordinator.properties#L79|code]]) both runs the job and moves the output files to more readable filenames. It's believed that this part will fail if more wikis are added though because Oozie doesn't handle for loops well at all. The fix for this is probably move the renaming part off of Oozie and onto the scala job or perhaps Airflow if we're being particularly ambitious about modernizing the job.
[] Does the privacy review still hold? Should there be some filters put in place for smaller wikis?
[] Right now all the data is coalesced onto a single partition and then split into individual wiki-specific files. Better would be to send each wiki to a single partition for writing. English Wikipedia is ~400MB and in theory should be the largest so a single worker should always be able to handle a single wiki.
Optional improvements:
[] Moving namespace to a parameter (right now hard-coded that only namespace 0 is kept but that might not work for some wikis where there are additional namespaces of interest): https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala#L317
[] Allow for querying all wikis. Right now a list of wikidbs is passed. If no wikidbs are provided though, there'd be no results and ideally you could easily say "give me all Wikipedias" without listing them all.
[] Does redirect handling need to be more flexible to handle multi-hop redirects -- e.g., A -> B -> C -> D? It's believed that bots fix those on-wiki but might be worth checking.