The scraper needs to be interruptible and should resume processing nearly at the place it was stopped.
Background: We previously had logic in "Checkpointer" which left lockfiles next to the output files, including the position into input streams which made it possible to resume an interrupted scraper job. Once we switched away from writing outputs to the filesystem and switched to Enterprise API input streams this logic had to be abandoned.
Implementation:
- Choose how to persist resume information. It could be in /var/run, for example.
- Resume job at the same wiki and chunk as it was stopped.
- Must also maintain the page_id list so that the scraper correctly rejects additional revisions of known pages, even after resume.