Page MenuHomePhabricator

Implement dataset maintenance for wmf_dumps.wikitext_raw
Closed, DuplicatePublic

Description

We want to implement dataset maintenance configuration for wmf_dumps.wikitext_raw.

It should include the equivalent of:

CALL spark_catalog.system.remove_orphan_files(
    table => 'wmf_dumps.wikitext_raw_rc2',
    older_than => TIMESTAMP '2024-01-04 15:05:54.351',
    max_concurrent_deletes => 10
)

CALL spark_catalog.system.expire_snapshots(
    table => 'wmf_dumps.wikitext_raw_rc2',
    older_than => TIMESTAMP '2024-01-04 15:05:54.351',
    max_concurrent_deletes => 10,
    stream_results => true
)

CALL spark_catalog.system.rewrite_manifests(
    table => 'wmf_dumps.wikitext_raw_rc2'
)

There is currently no need for a rewrite_data_files() CALL since we do copy-on-write MERGEs.

The above TIMESTAMPs should be calculated at runtime. For remove_orphan_files() we can be aggressive, and set it to delete anything 5 days or older. For expire_snapshots() we could set it to 90 days, but we still need to discuss this with other teams via T358366.

Event Timeline

Its been brought to my attention that we are hoarding quite some space on the Datalake since we haven't cleaned this table for a while and it has done ~5 backfills:

xcollazo@stat1011:~$ hdfs dfs -count /wmf/data/wmf_dumps/wikitext_raw_rc2/data
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
      187634      4436438    170104985003630 /wmf/data/wmf_dumps/wikitext_raw_rc2/data

That't 154TB.

So let's do a quick manual cleanup:

spark3-sql   --master yarn   --executor-memory 16G   --executor-cores 4   --driver-memory 32G   --driver-cores 8   --conf spark.dynamicAllocation.maxExecutors=64   --conf spark.sql.shuffle.partitions=1024   --conf spark.driver.maxResultSize=16G --conf spark.sql.autoBroadcastJoinThreshold=-1

SELECT NOW() - INTERVAL 1 DAY;
CAST(now() - INTERVAL '1 days' AS TIMESTAMP)
2024-06-03 19:23:47.234
Time taken: 12.016 seconds, Fetched 1 row(s)


CALL spark_catalog.system.remove_orphan_files(
    table => 'wmf_dumps.wikitext_raw_rc2',
    older_than => TIMESTAMP '2024-06-03 19:23:47.234',
    max_concurrent_deletes => 10
);
24/06/04 19:26:00 WARN TaskSetManager: Stage 1 contains a task of very large size (20436 KiB). The maximum recommended task size is 1000 KiB.
orphan_file_location
Time taken: 693.143 seconds


CALL spark_catalog.system.expire_snapshots(
    table => 'wmf_dumps.wikitext_raw_rc2',
    older_than => TIMESTAMP '2024-06-03 19:23:47.234',
    max_concurrent_deletes => 10,
    stream_results => true
);
deleted_data_files_count	deleted_manifest_files_count	deleted_manifest_lists_count
3770416	169446	8324
Time taken: 2316.76 seconds, Fetched 1 row(s)


CALL spark_catalog.system.rewrite_manifests(
    table => 'wmf_dumps.wikitext_raw_rc2'
);
rewritten_manifests_count	added_manifests_count
81	27
Time taken: 77.285 seconds, Fetched 1 row(s)

And now we have:

xcollazo@stat1011:~$ hdfs dfs -count /wmf/data/wmf_dumps/wikitext_raw_rc2/data
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
      187635       666455     26498168043374 /wmf/data/wmf_dumps/wikitext_raw_rc2/data

So ~3.7M files gone, and it now weights ~24TBs, as expected. 🎉

xcollazo renamed this task from Implement dataset maintenance config for wmf_dumps.wikitext_raw to Implement dataset maintenance for wmf_dumps.wikitext_raw.Aug 15 2024, 7:23 PM
xcollazo changed the task status from Open to In Progress.
xcollazo claimed this task.
xcollazo triaged this task as High priority.
xcollazo updated the task description. (Show Details)

Hoarding of TBs continues to be an issue:

xcollazo@stat1011:~$ hdfs dfs -count /wmf/data/wmf_dumps/wikitext_raw_rc2/data
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
      189466      2637013    105755359693874 /wmf/data/wmf_dumps/wikitext_raw_rc2/data

That's 96TB as of this comment. So let's prioritize this ticket.

Considering that T338065 is not prioritized right now, let's implement this via a simple DAG that runs the proper CALL commands as needed.