Maniphest T196063

Be smart about creation of temp stub files for the corresponding page output content
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ArielGlenn
	May 31 2018, 11:15 AM

Description

For large wikis, we generate page content output files for manageable page ranges, so that we can easily resume where we left off in case of problems, or redo bad parts. To do this, we need the corresponding stub file covering the specific range. To date we have been generating these page range stub files, kept in a temp directory, by zcatting the input stub file which might be several GB, reading from this stream, and writing an output file for the right page range; repeat this for every page range. That can mean rereading the same large file hundreds of times for e.g. enwiki.

Instead we can make a list of the page ranges in order, read the input stream once and write all the output files one at a time in order.

fix writeuptopageid to support this
fix python scripts to use the new writeuptopageid, and re-use existing temp files on retries (this implies the temp filenames have the dump run datestring)
package and deploy writeuptopageid
deploy the updated python scripts

Details

Subject	Repo	Branch	Lines +/-
generate multiple temp stub files at once for larger wikis	operations/dumps	master	+55 -40
generate temp stubs for page ranges serially from same input stub file	operations/dumps	master	+218 -40
allow writeuptopageid to write multiple output files	operations/dumps/mwbzutils	master	+748 -65

Customize query in gerrit

Related Objects

Mentioned In: R1891:4814921fe25a: allow writeuptopageid to write multiple output files
R1891:5722c544f395: allow writeuptopageid to write multiple output files
R1891:90bad81db627: allow writeuptopageid to write multiple output files
R1891:153cde0fb463: allow writeuptopageid to write multiple output files
R1891:00e60bdc0798: allow writeuptopageid to write multiple output files
R1891:06f5527614f0: allow writeuptopageid to write multiple output files
R1891:a62cd97a4820: allow writeuptopageid to write multiple output files
R1891:d2f8ac04b574: allow writeuptopageid to write multiple output files
R1891:e1bac7130d1d: allow writeuptopageid to write multiple output files

Event Timeline

ArielGlenn triaged this task as Medium priority.May 31 2018, 11:15 AM

ArielGlenn created this task.

ArielGlenn updated the task description. (Show Details)May 31 2018, 11:17 AM

Change 436511 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps/mwbzutils@master] allow writeuptopageid to write multiple output files

https://gerrit.wikimedia.org/r/436511

gerritbot added a project: Patch-For-Review.May 31 2018, 11:41 AM

ArielGlenn mentioned this in R1891:e1bac7130d1d: allow writeuptopageid to write multiple output files.May 31 2018, 3:54 PM

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jun 1 2018, 10:57 PM

Change 436956 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] generate temp stubs for page ranges serially from same input stub file

https://gerrit.wikimedia.org/r/436956

ArielGlenn mentioned this in R1891:d2f8ac04b574: allow writeuptopageid to write multiple output files.Jun 6 2018, 3:21 PM

ArielGlenn mentioned this in R1891:a62cd97a4820: allow writeuptopageid to write multiple output files.

ArielGlenn mentioned this in R1891:06f5527614f0: allow writeuptopageid to write multiple output files.Jun 7 2018, 1:02 PM

ArielGlenn mentioned this in R1891:00e60bdc0798: allow writeuptopageid to write multiple output files.Jun 13 2018, 4:36 PM

ArielGlenn mentioned this in R1891:153cde0fb463: allow writeuptopageid to write multiple output files.Jun 14 2018, 12:18 PM

ArielGlenn mentioned this in R1891:90bad81db627: allow writeuptopageid to write multiple output files.Jun 15 2018, 10:27 AM

ArielGlenn mentioned this in R1891:5722c544f395: allow writeuptopageid to write multiple output files.

ArielGlenn mentioned this in R1891:4814921fe25a: allow writeuptopageid to write multiple output files.Jun 15 2018, 10:45 AM

ArielGlenn updated the task description. (Show Details)Jun 18 2018, 7:59 AM

Here's where we are on this;

I have done a variety of timing tests to be sure that the changes to writeuptopageid do not slow down the job; with/without zcat from stdin, with/without gzip to stdout, writing only to /dev/null, on local disks, on an nf filesystem. All of these tests look good.

Summary: because we can now pass all args to witeuptopageid directly instead of using a pipe for stdin/stdout, cpu time on an unloaded system has dropped. However real time has gone up a small amount (1.5 minutes for a 3GB compressed file of stubs for the input). That is to be expected, since other cores will not now be used for the other parts of the pipeline. This is fine for us, as we manage parallelism ourselves in the scripts.

I checked to see if there is much difference in performance between -O2 and -O3 in the Makefile; the difference is negligible. This also is to be expected, since the libraries that do most of the work (e.g. lz) remain untouched.

I may convert other of the C utils to be able to read/write files passed as command-line args as well.

The bz2 additions to the code are not needed for our production dump scripts, but folks processing dump xml files may find it useful, particularly the ability to write multiple output files on one run.

Change 436511 merged by ArielGlenn:
[operations/dumps/mwbzutils@master] allow writeuptopageid to write multiple output files

https://gerrit.wikimedia.org/r/436511

Package for writeuptopageid (mwbzutils) is ready and the files are sitting in home dir on install1002 awaiting deployment. The binary extracted from the package has been tested with old and new argument style.

Change 442828 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] generate multiple temp stub files at once for larger wikis

https://gerrit.wikimedia.org/r/442828

Change 436956 merged by ArielGlenn:
[operations/dumps@master] generate temp stubs for page ranges serially from same input stub file

https://gerrit.wikimedia.org/r/436956

Change 442828 merged by ArielGlenn:
[operations/dumps@master] generate multiple temp stub files at once for larger wikis

https://gerrit.wikimedia.org/r/442828

Mentioned in SAL (#wikimedia-operations) [2018-06-30T21:32:44Z] <ariel@deploy1001> Started deploy [dumps/dumps@a1bc510]: generate temp stubs smarter, T196063

Mentioned in SAL (#wikimedia-operations) [2018-06-30T21:35:11Z] <ariel@deploy1001> Finished deploy [dumps/dumps@a1bc510]: generate temp stubs smarter, T196063 (duration: 02m 27s)

• Vvjjkkii renamed this task from Be smart about creation of temp stub files for the corresponding page output content to 4wbaaaaaaa.Jul 1 2018, 1:06 AM

• Vvjjkkii removed ArielGlenn as the assignee of this task.

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: gerritbot.

ArielGlenn renamed this task from 4wbaaaaaaa to Be smart about creation of temp stub files for the corresponding page output content.Jul 1 2018, 8:05 AM

ArielGlenn claimed this task.

ArielGlenn lowered the priority of this task from High to Medium.

ArielGlenn removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

ArielGlenn updated the task description. (Show Details)

ArielGlenn added a subscriber: gerritbot.

While I was in here I updated the scripts to generate temp stubs in parallel for larger wikis. That's deployed too and running.

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Jul 10 2018, 12:04 PM

A dump run has completed since this was deployed, and it works fine. Closing.

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.Jul 20 2018, 5:39 AM

Be smart about creation of temp stub files for the corresponding page output contentClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Be smart about creation of temp stub files for the corresponding page output content
Closed, ResolvedPublic
Actions