Page MenuHomePhabricator

Wikidata JSON dump generation broken
Closed, ResolvedPublicBUG REPORT



The last weekly Wikidata JSON dump has only half the size of the previous dump:

Can you fix the issue please? At least, the corrupted files should be removed.


Related Objects

Event Timeline

May be consequence of the bug fixed here:

In this case I would suggest trying to run the dump again or just deleting this JSON dump and waiting for a new one.

Whichever folks prefer, just let me know.

Hello, if you guys are sure the bug was fixed then no problem to wait for next week dump.
But if there is a risk it isn't, then running the dump again today would avoid to lose a few days before a full resolution and ensure that next week dump will be fine.

Also, as suggested by Envlh, the corrupted files should be removed/renamed.

Is it possible to have an update on this issue?
Could you confirm if and when the weekly dump will be generated again?
As suggested, in the meantime, I strongly recommend deleting the corrupted dumps to avoid confusion.
Thank you in advance!

I'd prefer to remove and wait for the new run, but I'd like @Smalyshev 's opinion on whether the dumps are most likely fixed, or not, since he was the one who handled the broken deployment at the time.

Hello all! Thanks for looking into this. Could I ask if a check might be put in place to catch this type of large dump generation error?

Yeah let's delete the broken one.

IIRC we do have checks but the size seems to be big enough so it slips through.

Smalyshev triaged this task as Medium priority.Jun 26 2019, 3:56 PM

I have moved the files wikidata-20190624-all.json.gz and wikidata-20190624-all.json.bz2 to filenames that end in .not. The 'latest' links for the json bz2 and gz files are now broken; this lets people know that the link s are missing instead of beguiling them into reprocessing last week's runs.

Just a heads up that mirrors still have the broken files linked to latest:

Not sure how often those update or how they would in this case.

They typically pull once a day or more frequently.

I'll leave this open until next week's run completes properly.

Thank you everyone.
In case it helps, in the borken file I noticed after each 3.8M records there are 17 lines with just a comma and nothing else.
Example lines:
I really hope next week will not fail.

Change 519493 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Wikidata dumps: Update minimum expected sizes

Change 519494 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] dumpwikidatajson: Fix error code detection

I looked into why these incomplete and broken dumps were even published, found and fixed the cause(s) (see above).

Change 519493 merged by ArielGlenn:
[operations/puppet@production] Wikidata dumps: Update minimum expected sizes

Change 519494 merged by ArielGlenn:
[operations/puppet@production] dumpwikidatajson: Fix error code detection

If we get through this run with happy json files, I'll close the task.

ArielGlenn claimed this task.
$ ls -lL /data/otherdumps/wikidata/20190701.json.gz 
-rw-r--r-- 1 dumpsgen dumpsgen 56026802223 Jul  2 21:55 /data/otherdumps/wikidata/20190701.json.gz



Sorry to reopen this bug, but it seems that the new dumps still have the .not extension:

It breaks the link in the page and thus all projects downloading the dumps with Wikidata Toolkit.

Can you fix the issue please?

@Envlh dumps were broken again, check out this new ticket they are working on now:

Going to close this task again, since the bug is different for the new problem.