Page MenuHomePhabricator

Scraper: page-summary .gz can become corrupted after crash
Closed, ResolvedPublic

Description

The checkpointing is incompatible with the gzip format, something is not finalized in the output--or maybe we're concatenating on top of a truncated file without ending a block safely.

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/66

Event Timeline

awight renamed this task from Scraper: invalid JSON error for many wikis to Scraper: page-summary .gz can become corrupted after crash.May 25 2023, 8:11 AM
awight updated the task description. (Show Details)

Find corrupted outputs:

(for i in reports/*.gz; do zcat $i > /dev/null || echo $i; done) > bad.txt