Page MenuHomePhabricator

Implement resumability for the scraper
Closed, DeclinedPublic

Description

The scraper needs to be interruptible and should resume processing nearly at the place it was stopped.

Background: We previously had logic in "Checkpointer" which left lockfiles next to the output files, including the position into input streams which made it possible to resume an interrupted scraper job. Once we switched away from writing outputs to the filesystem and switched to Enterprise API input streams this logic had to be abandoned.

Implementation:

  • Choose how to persist resume information. It could be in /var/run, for example.
  • Resume job at the same wiki and chunk as it was stopped.
  • Must also maintain the page_id list so that the scraper correctly rejects additional revisions of known pages, even after resume.

Related Objects

Event Timeline

awight removed awight as the assignee of this task.

On second thought, let's drop this for now. The largest wikis are a few hours of processing and can be run again in the worst case. Development effort is better spent on not crashing rather than complex machinery to recover from a crash.