Page MenuHomePhabricator

Wikidata JSON dump generation broken
Closed, ResolvedPublicBUG REPORT

Description

Hello,

The last weekly Wikidata JSON dump has only half the size of the previous dump:

Can you fix the issue please? At least, the corrupted files should be removed.

Thanks.

Related Objects

Event Timeline

Envlh created this task.Jun 26 2019, 6:29 AM

Adding @Smalyshev for comments.

May be consequence of the bug fixed here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/518782

In this case I would suggest trying to run the dump again or just deleting this JSON dump and waiting for a new one.

Whichever folks prefer, just let me know.

Hello, if you guys are sure the bug was fixed then no problem to wait for next week dump.
But if there is a risk it isn't, then running the dump again today would avoid to lose a few days before a full resolution and ensure that next week dump will be fine.

Also, as suggested by Envlh, the corrupted files should be removed/renamed.

Hello,
Is it possible to have an update on this issue?
Could you confirm if and when the weekly dump will be generated again?
As suggested, in the meantime, I strongly recommend deleting the corrupted dumps to avoid confusion.
Thank you in advance!

I'd prefer to remove and wait for the new run, but I'd like @Smalyshev 's opinion on whether the dumps are most likely fixed, or not, since he was the one who handled the broken deployment at the time.

Hello all! Thanks for looking into this. Could I ask if a check might be put in place to catch this type of large dump generation error?

Yeah let's delete the broken one.

IIRC we do have checks but the size seems to be big enough so it slips through.

Smalyshev triaged this task as Normal priority.Jun 26 2019, 3:56 PM

I have moved the files wikidata-20190624-all.json.gz and wikidata-20190624-all.json.bz2 to filenames that end in .not. The 'latest' links for the json bz2 and gz files are now broken; this lets people know that the link s are missing instead of beguiling them into reprocessing last week's runs.

Just a heads up that mirrors still have the broken files linked to latest: http://dumps.wikimedia.your.org/wikidatawiki/entities/.

Not sure how often those update or how they would in this case.

They typically pull once a day or more frequently.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jun 27 2019, 7:48 AM

I'll leave this open until next week's run completes properly.

Thank you everyone.
In case it helps, in the borken file I noticed after each 3.8M records there are 17 lines with just a comma and nothing else.
Example lines:
3811891
7425067
11235875
15044264
18855513
22664531
26475955
30280713
I really hope next week will not fail.
Thanks

hoo added a subscriber: hoo.Jun 27 2019, 7:13 PM

Change 519493 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Wikidata dumps: Update minimum expected sizes

https://gerrit.wikimedia.org/r/519493

Change 519494 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] dumpwikidatajson: Fix error code detection

https://gerrit.wikimedia.org/r/519494

hoo added a comment.Jun 27 2019, 7:18 PM

I looked into why these incomplete and broken dumps were even published, found and fixed the cause(s) (see above).

Change 519493 merged by ArielGlenn:
[operations/puppet@production] Wikidata dumps: Update minimum expected sizes

https://gerrit.wikimedia.org/r/519493

Change 519494 merged by ArielGlenn:
[operations/puppet@production] dumpwikidatajson: Fix error code detection

https://gerrit.wikimedia.org/r/519494

This should(tm) should never happen again.

If we get through this run with happy json files, I'll close the task.

ArielGlenn closed this task as Resolved.Jul 3 2019, 5:51 AM
ArielGlenn claimed this task.
$ ls -lL /data/otherdumps/wikidata/20190701.json.gz 
-rw-r--r-- 1 dumpsgen dumpsgen 56026802223 Jul  2 21:55 /data/otherdumps/wikidata/20190701.json.gz

Closing.

Envlh reopened this task as Open.Jul 3 2019, 4:51 PM

Hello,

Sorry to reopen this bug, but it seems that the new dumps still have the .not extension:
https://dumps.wikimedia.org/wikidatawiki/entities/20190701/

It breaks the link in the page https://dumps.wikimedia.org/other/wikidata/ and thus all projects downloading the dumps with Wikidata Toolkit.

Can you fix the issue please?

@Envlh dumps were broken again, check out this new ticket they are working on now: https://phabricator.wikimedia.org/T227207

ArielGlenn closed this task as Resolved.Jul 3 2019, 5:12 PM

Going to close this task again, since the bug is different for the new problem.

Envlh added a comment.Jul 3 2019, 8:21 PM

@TheDatum @ArielGlenn Thank you for the clarification!