Summary of meeting with @JAllemandou (updated now that oozie code has been migrated to Airflow) about some of the challenges / opportunities we might have for adding more wikis to the existing clickstream. The good news is that the scalability of the job isn't so much of an issue (size of data being processed won't change much) but there are some tweaks that are likely needed for the oozie coordination and just generally useful improvements:
Existing code:
- Main job: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala
- Scheduler / move files: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/clickstream/clickstream_monthly_dag.py
Scalability challenges:
- Does the privacy review still hold? Should there be some filters put in place for smaller wikis?
- Right now all the data is coalesced onto a single partition and then split into individual wiki-specific files. Better would be to send each wiki to a single partition for writing. English Wikipedia is ~400MB and in theory should be the largest so a single worker should always be able to handle a single wiki.
Optional improvements:
- Moving namespace to a parameter (right now hard-coded that only namespace 0 is kept but that might not work for some wikis where there are additional namespaces of interest): https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala#L317
- Allow for querying all wikis. Right now a list of wikidbs is passed. If no wikidbs are provided though, there'd be no results and ideally you could easily say "give me all Wikipedias" without listing them all.
- Does redirect handling need to be more flexible to handle multi-hop redirects -- e.g., A -> B -> C -> D? It's believed that bots fix those on-wiki but might be worth checking.