📊 [PLACEHOLDER] Algorithm and data pipeline performance tuning
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	gmodena
	Mar 8 2021, 11:05 AM

Description

Placeholder task for performance tuning work on the road to production.

The current PoC code run time, on a regular sized spark job, takes ~10 minutes (on average) to generate raw output data for a single wiki. For the 300+ languages, the total run time for the data pipeline is approx 50 hours and uses ~4% (average) of the analytics cluster resources, or 25% (average) of its allocated queue.

The job is composed by two steps: first a sql query aggregates mediawiki and wikidata data, then result is collected to the driver and post-processed in-memory.

There's a number of things we could address to improve runtime, reliability, and standardise with Data Engineering best practices.

Regarding python code:

Pospt-rocessing code should be turned into UDFs and executed on Spark (distributed). This will remove any local computation. This is a requirement for running the job in airflow dags. WIP PR at https://github.com/mirrys/ImageMatching/pull/29. In Spark3 we might consider using high order functions.

I consulted with @JAllemandou about possible chance that could improve performance (runtime/memory consumption),
With the caveat that we'll more datapoints to gauge impact:

We should consider running one single long running job vs. batches of short lived jobs
We should split the SQL query in sub-dataframes (e.g. temp tables)
We should consider better caching intermediate computation
We should move to an "all wiki" approach for dataset generatio (instead of running a query for each wiki) so we can cache commonswiki and wikidata tables (that are not partitioned d by (wiki_db).
We might want to expriment converting relevant source tables/partitions from avro to parquet.

This task will require liaising with research and analytics engineering (cc @Miriam and @JAllemandou) since it will have dependencies and implication across teams

Related Objects

Mentioned In: T281687: 📊wikidata instance labels should be extracted based on the wiki language

Event Timeline

gmodena created this task.Mar 8 2021, 11:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2021, 11:05 AM

• sdkim added projects: Image-Suggestions, Image-Suggestion-API.Mar 12 2021, 5:20 PM

gmodena mentioned this in T281687: 📊wikidata instance labels should be extracted based on the wiki language.May 27 2021, 6:54 PM

gmodena updated the task description. (Show Details)Oct 11 2021, 10:15 AM

gmodena updated the task description. (Show Details)

gmodena updated the task description. (Show Details)Oct 11 2021, 10:22 AM

gmodena updated the task description. (Show Details)Oct 11 2021, 10:50 AM

Closing since this task was superseded by other image suggestions tasks. Please reopen if I'm mistaken.

📊 [PLACEHOLDER] Algorithm and data pipeline performance tuning Closed, InvalidPublicActions

Description

Related Objects

Event Timeline

📊 [PLACEHOLDER] Algorithm and data pipeline performance tuning
Closed, InvalidPublic
Actions