Page MenuHomePhabricator

Scraper: entire pipeline should be restartable
Closed, ResolvedPublic

Description

We have a checkpointer around the main scraping step, which makes it possible to restart an interrupted job. However, there are a few more requirements to make the entire workflow able to efficiently reuse completed results:

  • If a completed output exists, skip the entire step (input and output). Log.
  • All expensive jobs should use checkpointing: fetch_mapdata.
  • Aggregation jobs should be restarted from the beginning when interrupted. Checkpointing is possible but feels like more trouble than it's worth for these shorter jobs. A simple lockfile is fine.

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/46