We want to implement dataset maintenance configuration for wmf_dumps.wikitext_raw.
It should include the equivalent of:
CALL spark_catalog.system.remove_orphan_files(
table => 'wmf_dumps.wikitext_raw_rc2',
older_than => TIMESTAMP '2024-01-04 15:05:54.351',
max_concurrent_deletes => 10
)
CALL spark_catalog.system.expire_snapshots(
table => 'wmf_dumps.wikitext_raw_rc2',
older_than => TIMESTAMP '2024-01-04 15:05:54.351',
max_concurrent_deletes => 10,
stream_results => true
)
CALL spark_catalog.system.rewrite_manifests(
table => 'wmf_dumps.wikitext_raw_rc2'
)There is currently no need for a rewrite_data_files() CALL since we do copy-on-write MERGEs.
The above TIMESTAMPs should be calculated at runtime. For remove_orphan_files() we can be aggressive, and set it to delete anything 5 days or older. For expire_snapshots() we could set it to 90 days, but we still need to discuss this with other teams via T358366.