Page MenuHomePhabricator

Scraper: page-summary .gz can become corrupted after crash
Open, Needs TriagePublic

Description

The checkpointing is incompatible with the gzip format, something is not finalized in the output--or maybe we're concatenating on top of a truncated file without ending a block safely.

Event Timeline

awight renamed this task from Scraper: invalid JSON error for many wikis to Scraper: page-summary .gz can become corrupted after crash.Thu, May 25, 8:11 AM
awight updated the task description. (Show Details)

Find corrupted outputs:

(for i in reports/*.gz; do zcat $i > /dev/null || echo $i; done) > bad.txt