Page MenuHomePhabricator

Scraper: add configuration for the snapshot date
Closed, ResolvedPublic

Description

Our use case is to work intensively on a single snapshot (eg. 20230401, https://dumps.wikimedia.org/other/enterprise_html/runs/20230401/ ) and process it completely. Even if a new snapshot appears, we don't want to switch over to that because mixing snapshots would be inconsistent. When we run this tool again in the future, we'll be using a new snapshot date.

This is a good fit for application configuration, introducing a new key in config/prod.exs . The value should be read in pipeline.ex where filenames are constructed, and in dumps_mirror.ex overriding the latest snapshot discovery logic.

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/40