In T369868, we introduced extra jobs that effective do 3 writes per hour, and the runtime is now approaching ~38 mins per each consumed hour with the reasources and flags specified here, copied for convenience:
props = DagProperties(
# DAG settings
start_date=datetime(2023, 8, 23, 0),
sla=timedelta(hours=6),
conda_env=artifact("mediawiki-content-dump-0.2.0.dev0-ingest-deletes-and-moves.conda.tgz"),
# target table
hive_wikitext_raw_table="wmf_dumps.wikitext_raw_rc2",
# source tables
hive_mediawiki_page_content_change_table="event.mediawiki_page_content_change_v1",
hive_revision_visibility_change="event.mediawiki_revision_visibility_change",
# Spark job tuning
driver_memory="16G",
driver_cores="4",
executor_memory="16G",
executor_cores="2",
max_executors="64",
spark_driver_maxResultSize="8G",
# keep shuffler partitions low so that the final file fanout is low as well
spark_sql_shuffle_partitions="64",
# avoid java.lang.StackOverflowError when generating MERGE predicate pushdowns
spark_extraJavaOptions="-Xss4m",
# disable fetching HDFS BlockLocations to avoid very long query planning times.
spark_sql_iceberg_locality_enabled="false",
)In this task we should take a good look at the query plans and figure what is spending the most time and try and tune it out.
