Page MenuHomePhabricator

[Iceberg Migration] Extend Iceberg table maintenance mechanism to support data rewrite
Closed, ResolvedPublic

Description

In T338065: [Iceberg Migration] Implement mechanism for automatic Iceberg table maintenance we developed a mechanism that can do typical table maintenance for Iceberg table.

However, we did not implement support for the rewrite_data_files() Spark procedure. We did this to control the scope of the task, and also because there is currently no need for this mechanism. However, once we migrate wmf.event_sanitized we will definitely need this, as this dataset is currently the biggest offender in terms of amount of files in HDFS.

In this task, we should extend this mechanism to support rewrite_data_files().

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add support for Iceberg's rewrite_data_files().repos/data-engineering/airflow-dags!849xcollazoadd-iceberg-maintenancemain
Customize query in GitLab

Event Timeline

While working on T369868, we figured that we will indeed need this mechanism for Dumps 2.0, as the new approach of 3 writes per hour is too time intensive if we do not do merge-on-read, thus tagging this appropriately.

xcollazo changed the task status from Open to In Progress.Sep 27 2024, 8:52 PM
xcollazo claimed this task.
xcollazo moved this task from Incoming to To be discussed/To be estimated on the Dumps 2.0 board.
xcollazo moved this task from To be discussed/To be estimated to Kanban Board on the Dumps 2.0 board.
xcollazo edited projects, added Dumps 2.0 (Kanban Board); removed Dumps 2.0.
xcollazo moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

Copypasting from MR for completeness:

In this MR, we expand the maintenance mechanism introduced on !806 to also support rewrite_data_files().

Example:

iceberg_wmf_dumps_wikitext_raw:
  datastore: iceberg
  table_name: wmf_dumps.wikitext_raw_rc2
 maintenance:
    schedule: "@daily"
    rewrite_data_files:
      enabled: True
      strategy: "sort"
      sort_order: "wiki_db ASC NULLS FIRST, revision_timestamp ASC NULLS FIRST"  # Defined due to Iceberg 1.2.1 bug
      options:
        "max-concurrent-file-group-rewrites": "40"
        "partial-progress.enabled": "true"
      spark_kwargs:
        driver_memory: "32g"
        driver_cores: "4"
        executor_memory: "20g"
        executor_cores: "2"
        pool: mutex_for_wmf_dumps_wikitext_raw
        priority_weight: 10
      spark_conf:
        "spark.dynamicAllocation.maxExecutors": "64"

Notice how the config above now includes a rewrite_data_files section. Under that, Iceberg's strategy, options and more procedure parameters are verbatim from https://iceberg.apache.org/docs/1.6.1/spark-procedures/#rewrite_data_files. spark_kwargs and spark_conf are provided to be able to define the spark resources, as these will vary on a per table basis. If not defined, the cluster defaults apply.

Via spark_kwargs you can send in any kwargs that are relevant to a SparkSqlOperator. In the example above we send in kwargs pool and priority_weight to leverage Airflow's pools for this particular table.

Mentioned in SAL (#wikimedia-operations) [2024-10-03T14:52:32Z] <xcollazo@deploy2002> Started deploy [airflow-dags/analytics@b715af7]: Deploy latest DAGs to the analytics Airflow instance. T373694. T375402

Mentioned in SAL (#wikimedia-operations) [2024-10-03T14:56:02Z] <xcollazo@deploy2002> Finished deploy [airflow-dags/analytics@b715af7]: Deploy latest DAGs to the analytics Airflow instance. T373694. T375402 (duration: 03m 33s)

Mentioned in SAL (#wikimedia-analytics) [2024-10-03T14:56:25Z] <xcollazo> Deployed latest DAGs to the analytics Airflow instance. T373694. T375402.