Change Details

Placeholder task for performance tuning work on the road to production. The current PoC code run time, on a [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark | regular sized spark job ]], takes ~10 minutes (on average) to generate raw output data for a single wiki. For the 300+ languages, the total run time for the data pipeline is approx 50 hours and uses ~4% (average) of the analytics cluster resources, or 25% (average) of its allocated queue. There's a number of things we could address to improve runtimeThe job is composed by two steps: first a sql query aggregates mediawiki and wikidata data, and standardise with AE best practices. These include (but are not limited to): - Parallelising the algo code that currently runs on the spark driverthen result is collected to the driver and post-processed in-memory. There's a number of things we could address to improve runtime, and standardise with AE best practices. - R[] Pospt-rocessing code should be turned into UDFs and executed on Spark (distributed). This will remove any local computation. This is a requirement for running the job in airflow dags.. - [] We should consider running one single long running job vs. batches of short lived jobs - Resource allocation tuning- [] We should split the SQL query in sub-dataframes (e.g. temp tables) - Move to an "all wiki" approach for dataset generation, instead of a single query per wiki- [] We should consider better caching intermediate computation - Caching wikidata intermediate result set[] Move to an "all wiki" approach for dataset generation, instead of a single query per wiki so we can cache commonswiki and wikidata tables - [] Data formats choices (parquet vs avro vs csv) This task will require liaising with research and analytics engineering (cc @Miriam and @JAllemandou) since it will have dependencies and implication across teams