Page MenuHomePhabricator

Index formating issue in multistream-index file
Closed, ResolvedPublic

Description

It seems there is a problem when generating the multistream index file. The offset are formated as double when too high (e.g. in the last enwiki dump enwiki-20190120-pages-articles-multistream-index.txt.bz2 we have the first error being 2.14764e+09:1607683:Anacreon of Painters) thus resulting in truncation of the long value. Try a

grep "e+0" index.txt

to find out some more examples

Event Timeline

Hmm, that will be an artifact of the new recombine code. I'll have a look.

This turns out to be an annoying limitation of (m)awk, that numbers are actually doubles. I'll have to write a tiny python script to do the index file merge.

Change 490591 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] generate recombined multistream index file without (m)awk

https://gerrit.wikimedia.org/r/490591

Only the so-called 'big wikis' are subjec to this bug, since only for them do we generate smaller index files and recombine them. Of those, the only ones big enough to have the issue are:
en, wikidata, commons, de, fr, es, it, ja, ru wikis. Once there is a verified standalone script to just fix up the index file, I'll run it followed by removal of the sha1 and md5sums for the bad index files, and then a no-op on all of these wikis so that status files are regenerated. Wikidata will need to wait until it's completed the 7z step that it's currently in the middle of, since we can't regenerate status files for a wiki currently in the middle of a dump run.

https://gist.github.com/apergos/49c2812292c0ec26ccedc8bc9d5d69e5 is a standalone script which does what the gerrit patch does but regenerates just the combined index file, to an arbitrary location as opposed to the live dump directory.

I've tested this with big wikis that are not broken, and the output is identical to the current index files. I've tested it with a broken wiki (enwiki), and the output looks good.

I'll start doing manual recombines now, and when they're all done I'll move the new files into place and proceed with the rest of the cleanup.

I'll merge and deploy the gerrit patch after the current dump run completes.

New files have been generated for all of the above wikis; new sha1/md5 sums and status files are being generated for all but enwiki and wikidatawiki right now, and should be available for download sometime tomorrow.

Status and sha1/md5sum files are being regenerated for enwiki now and should be available for download later in the day. Wikidata's dump run should complete later today and then these files can be rebuilt for it as well.

Wikidatawiki's run is complete. I am now running the no-op job for that; the enwiki noop job is still running but should complete in some hours.

The last fixup jobs are complete, and all updated files should be available for download along with corrected checksums later today, if they are not already.

Change 490591 merged by ArielGlenn:
[operations/dumps@master] generate recombined multistream index file without (m)awk

https://gerrit.wikimedia.org/r/490591

Merged and deployed, with a commit message that I forgot to update, though I double-checked everything else (of course). No longer a work in progress and has been tested with large files. We'll see how everything goes on the run of the 20th.

The index files from that run look ok to me. @Nirv75 Does everything check out for you?

ArielGlenn claimed this task.

I'm going to go ahead and close this; feel free to re-open it if there is a problem.