Page MenuHomePhabricator

wikidata weekly dumps take too long to complete
Closed, ResolvedPublic

Description

With the addition of the nt dumps using serdi, the weekly run completes on Saturday at around 9:30 pm UTC; that means there's virtually no time when these dumps are not running, no time for recovery if something has to be rerun, no time for maintenance of the host they run on (for example, security updates of kernels plus their reboots), etc. Note that this is before the lexeme dump was added.

What can we do about this?

Event Timeline

ArielGlenn triaged this task as Medium priority.Oct 9 2018, 1:15 PM
ArielGlenn created this task.
ArielGlenn added subscribers: Smalyshev, hoo.

Adding @hoo and @Smalyshev in hopes that they will have some good ideas.

Well, the dumps are big, so not sure whether it's possible to do much about it... Maybe we could reduce frequency to bi-weekly or something?

Also, the longest operation right now seems to be re-zipping (gz -> bz2) of .nt dump. It takes over 1.5 days, judging by timestamps (unfortunately, I can't see from timestamps how much ttl->nt takes). I wonder if there's a way to generate .gz and .bz2 in parallel. .bz2 can compose chunks just like .gz so maybe there's a way to do it?

Other options:

  • Re-do performance audit of the dump generator, we did it last time 2+ years ago IIRC and there may be some potential for improvement
  • Remove (or reduce frequency of) .ttl dump - it duplicates .nt one and the latter is superior in terms of processing, though larger. .ttl is much more readable, etc. but I am not sure how much readability matters in a 70G dump.
  • Play with parallelism/sharding/etc. - maybe there are some things there that we can tweak to make it run faster.

I've enabled the use of lbzip2 for the xml/sql dumps starting with the Oct 20th run; we could consider using this for the wikidata weeklies recompression into bz2 files, at, say, four threads (half the number of shards). As far as I can tell it puts out binary-format compatible dumps to those produced by bzip2, though not byte-identical output. What do folks think?

ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20181015$ date; zcat wikidata-20181015-all-BETA.ttl.gz | lbzip2 -n 4 > /mnt/dumpsdata/temp/ariel/wikidata-20181015-all-BETA.ttl.bz2; date 
Wed Oct 17 12:11:32 UTC 2018
Wed Oct 17 13:25:23 UTC 2018
ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20181015$ ls -lh wikidata-20181015-all-BETA.ttl.gz /mnt/dumpsdata/temp/ariel/wikidata-20181015-all-BETA.ttl.bz2
-rw-rw-r-- 1 ariel    wikidev  37G Oct 17 13:25 /mnt/dumpsdata/temp/ariel/wikidata-20181015-all-BETA.ttl.bz2
-rw-r--r-- 1 dumpsgen dumpsgen 44G Oct 16 15:05 wikidata-20181015-all-BETA.ttl.gz

That was run on the same host and same filesystem while other parts of the wikidataweeklies are currently running, so the estimate of 75 minutes for compression of 44GB to 37GB is likely pretty good.

I've enabled the use of lbzip2 for the xml/sql dumps starting with the Oct 20th run; we could consider using this for the wikidata weeklies recompression into bz2 files, at, say, four threads (half the number of shards). As far as I can tell it puts out binary-format compatible dumps to those produced by bzip2, though not byte-identical output. What do folks think?

Have you tested importing that via php (and/or anything else that uses the libzip2 compat stuff)?

I just thought about this a bit and we might want to split the dumping process up into two steps:

  1. Generating the dump via the maintenance script (sharded) and concatenating the shards
  2. Re-Compress/ format conversion (ttl <> nt)/ …

This way we could run 1) in serial/ limited parallel and 2) in parallel, as these steps don't are single threaded anyway.

I think bzip2 format is standartized, so all well-behaving tools should be interoperable. I'd try it with lbzip2 and check the recent dumps - if they work with standard tools then I think it should be fine.

I just thought about this a bit and we might want to split the dumping process up into two steps:

I thought that's what is happening now? Or I miss something?

I just thought about this a bit and we might want to split the dumping process up into two steps:

I thought that's what is happening now? Or I miss something?

Well, currently we invoke one bash script (via cron) that does all the work (the actual dumping in subshells, though). I'm thinking about splitting this so that a new cron might look like this: dumpAllTtl && recompressTtl & convertTtlToNt &… I hope that makes it clearer.

...

Have you tested importing that via php (and/or anything else that uses the libzip2 compat stuff)?

script:

[ariel@bigtrouble ~]$ more catbz2file.php
<?php
/*
 * Uncompress a bzip2-compressed file and write it to stdout
 */
$filename = 'compress.bzip2://' . '/mnt/dumpsdata/temp/ariel/wikidata-20181015-all-BETA.ttl.bz2';
// $filename = 'compress.bzip2://' . '/home/ariel/wmf/dumps/lbzip2/commonswiki-20180520-stub-meta-current4.xml.bz2';
$betattl = fopen($filename, "r");
while (($line = fgets($betattl, 4096)) !== false) {
    echo $line;
}
fclose($betattl);

md5sum of original (gz) file:

ariel@dumpsdata1002:/data/otherdumps/wikibase/wikidatawiki/20181015$ zcat wikidata-20181015-all-BETA.ttl.gz | md5sum
6ae514b7b889f55c787e37bdf6cc72ed  -

md5sum of lbzip2-ed file:

ariel@snapshot1008:~$  /usr/bin/php7.0 catbz2file.php  | md5sum
6ae514b7b889f55c787e37bdf6cc72ed  -

Change 474159 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] use lbzip2 for recompression of wikidata weeky json dumps

https://gerrit.wikimedia.org/r/474159

What do folks think about this for a first step? When we're happy that these are ok, we can roll out to the rdf dumps. Right now the last dumps of the weekly (lexeme) finish on Sunday so that's just not sustainable going forwards.

This looks very good. I just checked and even the latest PHP still uses the "zlib compatibility functions" which had troubles with chunked bzip2s before (see T118379) (it uses them for both the compress.bzip2 stream wrapper and for [[https://secure.php.net/manual/en/function.bzopen.php|bzopen]]).

Given this now works (which should also mean that this works in Python etc.), I think this is good to go.

The lbzip2 code doesn't produce chunked bzip2 streams (like e.g. the multistream xml pages-articles dumps). It's one stream only. I expect that is why php runs ok on it.

Change 474159 merged by ArielGlenn:
[operations/puppet@production] use lbzip2 for recompression of wikidata weeky json dumps

https://gerrit.wikimedia.org/r/474159

The first use of lbzip2 has been deployed, so that it can take effect for tomorrow's json dumps. If that goes well, I'd like to enable it for the rdf dumps during next week's run.

Change 480140 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] use lbzp2 in wikidata rdf weeklies

https://gerrit.wikimedia.org/r/480140

Change 480140 merged by ArielGlenn:
[operations/puppet@production] use lbzip2 in wikidata rdf weeklies

https://gerrit.wikimedia.org/r/480140

As you see I merged the change to the rdf shell script. The one running now is all nt, so we won't see the lbzip2 use until the next part of the cron job, the truthy nt ones. I double checked the output from the json files, and the md5sum of the gz and bz2 files are identical (used bzcat to decompress).

This is already an improvement; the weeklies finished late Saturday night instead of on Monday. This coming run should go faster, since lbzip2 will be used for all/nt and all/ttl as well.

ArielGlenn claimed this task.

It looks like the entire run now finishes in the wee hours of Friday morning. This is much better. I'm sure with the rapid growth of wikidata we'll have to revisit this in the future but for now... closing!