Page MenuHomePhabricator

Update clickstream code to support more languages
Open, LowPublic

Description

Summary of meeting with @JAllemandou (updated now that oozie code has been migrated to Airflow) about some of the challenges / opportunities we might have for adding more wikis to the existing clickstream. The good news is that the scalability of the job isn't so much of an issue (size of data being processed won't change much) but there are some tweaks that are likely needed for the oozie coordination and just generally useful improvements:

Existing code:

Scalability challenges:

  • Does the privacy review still hold? Should there be some filters put in place for smaller wikis?
  • Right now all the data is coalesced onto a single partition and then split into individual wiki-specific files. Better would be to send each wiki to a single partition for writing. English Wikipedia is ~400MB and in theory should be the largest so a single worker should always be able to handle a single wiki.

Optional improvements:

Related Objects

Event Timeline

hopefully this captures everything. maybe folks can add their username next to an item in the description if they want to claim it? feel free to add others too etc.!

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.
odimitrijevic subscribed.

This task can be consider as part of airflow migration.

Isaac renamed this task from Update clickstream builder scala/oozie code to support more languages to Update clickstream airflow code to support more languages.Aug 29 2022, 5:37 PM
Isaac renamed this task from Update clickstream airflow code to support more languages to Update clickstream code to support more languages.
Isaac updated the task description. (Show Details)

I'm going to remove this task from the Backlog lane of the Research board given that there is no task for Research here, yet. Once prioritized, please reach out to us with a subtask and add Research back. We would be happy to look into prioritizing supporting you at that point.