Page MenuHomePhabricator

Reduce the number of refine job calls to the hive metastore
Closed, ResolvedPublic

Description

thanks to this slack thread, we noticed that the hive metastore started to receive a very big number of queries per second when we deployed the airflow refine job:
https://grafana.wikimedia.org/goto/kk6c6tivg?orgId=1

We need to fix this.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update main refinery jar versionrepos/data-engineering/airflow-dags!1822joalupdate_main_refine_jarmain
Bump analytics_test refine jarrepos/data-engineering/airflow-dags!1819aquT410378_bump_refine_to_hive_job_jarmain
Customize query in GitLab

Event Timeline

Change #1206823 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Update Refine-CLI job to not overwhelm metastore

https://gerrit.wikimedia.org/r/1206823

I'm pretty sure I have nailed down the problem.
At write time, the refine job inserts partitions in overwrite mode. To do so, it is needed to set spark.sql.sources.partitionOverwriteMode to dynamic, otherwise Spark overwrites the entire table, not the partitions touched only.
Changing this approach to a delete+insert where we drop the partition, write the files in a folder with overwrite, then recreate the partition, has solved the problem.
My guess is that for Spark to decide about partitions drop and recreate, it first requests information about all the existing partitions in the table, to store that information in its own format (catalog) instead of having in the metastore only.
Anyhow, I have patch fixing the issue (see above), it currently runs on the test-cluster successfully, we'll deploy it tomorrow.

Change #1206823 merged by jenkins-bot:

[analytics/refinery/source@master] Update Refine-CLI job to not overwhelm metastore

https://gerrit.wikimedia.org/r/1206823