Scraper: entire pipeline should be restartable
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Apr 20 2023, 10:10 AM

Description

We have a checkpointer around the main scraping step, which makes it possible to restart an interrupted job. However, there are a few more requirements to make the entire workflow able to efficiently reuse completed results:

If a completed output exists, skip the entire step (input and output). Log.
All expensive jobs should use checkpointing: fetch_mapdata.
Aggregation jobs should be restarted from the beginning when interrupted. Checkpointing is possible but feels like more trouble than it's worth for these shorter jobs. A simple lockfile is fine.

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/46

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	None	T335108 Scraper: entire pipeline should be restartable