For large wikis, we generate page content output files for manageable page ranges, so that we can easily resume where we left off in case of problems, or redo bad parts. To do this, we need the corresponding stub file covering the specific range. To date we have been generating these page range stub files, kept in a temp directory, by zcatting the input stub file which might be several GB, reading from this stream, and writing an output file for the right page range; repeat this for every page range. That can mean rereading the same large file hundreds of times for e.g. enwiki.
Instead we can make a list of the page ranges in order, read the input stream once and write all the output files one at a time in order.
- fix writeuptopageid to support this
- fix python scripts to use the new writeuptopageid, and re-use existing temp files on retries (this implies the temp filenames have the dump run datestring)
- package and deploy writeuptopageid
- deploy the updated python scripts