We have a checkpointer around the main scraping step, which makes it possible to restart an interrupted job. However, there are a few more requirements to make the entire workflow able to efficiently reuse completed results:
- If a completed output exists, skip the entire step (input and output). Log.
- All expensive jobs should use checkpointing: fetch_mapdata.
- Aggregation jobs should be restarted from the beginning when interrupted. Checkpointing is possible but feels like more trouble than it's worth for these shorter jobs. A simple lockfile is fine.
Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/46