Page MenuHomePhabricator

Be smart about creation of temp stub files for the corresponding page output content
Closed, ResolvedPublic

Description

For large wikis, we generate page content output files for manageable page ranges, so that we can easily resume where we left off in case of problems, or redo bad parts. To do this, we need the corresponding stub file covering the specific range. To date we have been generating these page range stub files, kept in a temp directory, by zcatting the input stub file which might be several GB, reading from this stream, and writing an output file for the right page range; repeat this for every page range. That can mean rereading the same large file hundreds of times for e.g. enwiki.

Instead we can make a list of the page ranges in order, read the input stream once and write all the output files one at a time in order.

  • fix writeuptopageid to support this
  • fix python scripts to use the new writeuptopageid, and re-use existing temp files on retries (this implies the temp filenames have the dump run datestring)
  • package and deploy writeuptopageid
  • deploy the updated python scripts

Event Timeline

ArielGlenn created this task.

Change 436511 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps/mwbzutils@master] allow writeuptopageid to write multiple output files

https://gerrit.wikimedia.org/r/436511

Change 436956 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] generate temp stubs for page ranges serially from same input stub file

https://gerrit.wikimedia.org/r/436956

Here's where we are on this;

I have done a variety of timing tests to be sure that the changes to writeuptopageid do not slow down the job; with/without zcat from stdin, with/without gzip to stdout, writing only to /dev/null, on local disks, on an nf filesystem. All of these tests look good.

Summary: because we can now pass all args to witeuptopageid directly instead of using a pipe for stdin/stdout, cpu time on an unloaded system has dropped. However real time has gone up a small amount (1.5 minutes for a 3GB compressed file of stubs for the input). That is to be expected, since other cores will not now be used for the other parts of the pipeline. This is fine for us, as we manage parallelism ourselves in the scripts.

I checked to see if there is much difference in performance between -O2 and -O3 in the Makefile; the difference is negligible. This also is to be expected, since the libraries that do most of the work (e.g. lz) remain untouched.

I may convert other of the C utils to be able to read/write files passed as command-line args as well.

The bz2 additions to the code are not needed for our production dump scripts, but folks processing dump xml files may find it useful, particularly the ability to write multiple output files on one run.

Change 436511 merged by ArielGlenn:
[operations/dumps/mwbzutils@master] allow writeuptopageid to write multiple output files

https://gerrit.wikimedia.org/r/436511

Package for writeuptopageid (mwbzutils) is ready and the files are sitting in home dir on install1002 awaiting deployment. The binary extracted from the package has been tested with old and new argument style.

Change 442828 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] generate multiple temp stub files at once for larger wikis

https://gerrit.wikimedia.org/r/442828

Change 436956 merged by ArielGlenn:
[operations/dumps@master] generate temp stubs for page ranges serially from same input stub file

https://gerrit.wikimedia.org/r/436956

Change 442828 merged by ArielGlenn:
[operations/dumps@master] generate multiple temp stub files at once for larger wikis

https://gerrit.wikimedia.org/r/442828

Mentioned in SAL (#wikimedia-operations) [2018-06-30T21:32:44Z] <ariel@deploy1001> Started deploy [dumps/dumps@a1bc510]: generate temp stubs smarter, T196063

Mentioned in SAL (#wikimedia-operations) [2018-06-30T21:35:11Z] <ariel@deploy1001> Finished deploy [dumps/dumps@a1bc510]: generate temp stubs smarter, T196063 (duration: 02m 27s)

Vvjjkkii renamed this task from Be smart about creation of temp stub files for the corresponding page output content to 4wbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii removed ArielGlenn as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: gerritbot.
ArielGlenn renamed this task from 4wbaaaaaaa to Be smart about creation of temp stub files for the corresponding page output content.Jul 1 2018, 8:05 AM
ArielGlenn claimed this task.
ArielGlenn lowered the priority of this task from High to Medium.
ArielGlenn updated the task description. (Show Details)
ArielGlenn added a subscriber: gerritbot.

While I was in here I updated the scripts to generate temp stubs in parallel for larger wikis. That's deployed too and running.

A dump run has completed since this was deployed, and it works fine. Closing.