thanks to this slack thread, we noticed that the hive metastore started to receive a very big number of queries per second when we deployed the airflow refine job:
https://grafana.wikimedia.org/goto/kk6c6tivg?orgId=1
We need to fix this.
thanks to this slack thread, we noticed that the hive metastore started to receive a very big number of queries per second when we deployed the airflow refine job:
https://grafana.wikimedia.org/goto/kk6c6tivg?orgId=1
We need to fix this.
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Update Refine-CLI job to not overwhelm metastore | analytics/refinery/source | master | +43 -32 |
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Update main refinery jar version | repos/data-engineering/airflow-dags!1822 | joal | update_main_refine_jar | main | |
| Bump analytics_test refine jar | repos/data-engineering/airflow-dags!1819 | aqu | T410378_bump_refine_to_hive_job_jar | main |
Change #1206823 had a related patch set uploaded (by Joal; author: Joal):
[analytics/refinery/source@master] Update Refine-CLI job to not overwhelm metastore
I'm pretty sure I have nailed down the problem.
At write time, the refine job inserts partitions in overwrite mode. To do so, it is needed to set spark.sql.sources.partitionOverwriteMode to dynamic, otherwise Spark overwrites the entire table, not the partitions touched only.
Changing this approach to a delete+insert where we drop the partition, write the files in a folder with overwrite, then recreate the partition, has solved the problem.
My guess is that for Spark to decide about partitions drop and recreate, it first requests information about all the existing partitions in the table, to store that information in its own format (catalog) instead of having in the metastore only.
Anyhow, I have patch fixing the issue (see above), it currently runs on the test-cluster successfully, we'll deploy it tomorrow.
Change #1206823 merged by jenkins-bot:
[analytics/refinery/source@master] Update Refine-CLI job to not overwhelm metastore
aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1819
Bump analytics_test refine jar
aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1819
Bump analytics_test refine jar
joal opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1822
Update main refinery jar version
joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1822
Update main refinery jar version