Page MenuHomePhabricator

Wikidata JSON dump (bz2) no longer imports due to bad JSON format
Closed, ResolvedPublicBUG REPORT

Description

I normally have been ingesting wikidata json dump files into mongodb using mongo import. This has worked for a year or so and then the last two weekly dumps have failed with this error:

2021-03-05T15:35:17.320-0800 Failed: error reading separator after document #11554732: bad JSON array format - found '{' outside JSON object/array in input source
2021-03-05T15:35:17.320-0800 11553900 document(s) imported successfully. 0 document(s) failed to import.

The command I run is:
bunzip2 -dc ./wiki_job/latest-all.json.bz2 | mongoimport --host 127.0.0.1:27017 --db wikiData --collection wiki --type json --drop --numInsertionWorkers 4 --jsonArray

The dumps affected are March 3 and February 24 (as of 3-05-2021).
Feb 24th dump: https://dumps.wikimedia.org/wikidatawiki/entities/20210222/wikidata-20210222-all.json.bz2
March 3rd dump: https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2

I am not sure what has changed in the dump file but I have tried various mongoimport parameters but all exhibit the issue. The weekly dumps before Feb 24th are fine.

Event Timeline

Lydia_Pintscher moved this task from Incoming to Unconnected Stories on the Wikidata-Campsite board.
Lydia_Pintscher added a subscriber: Addshore.

I have the same issue.
I have a script to extract entities from Wikidata dumps, that I've been running successfully for years.

The last time I ran it, on current latest-all.json.bz2 (03-Mar-2021 14:10, size 63323125695), it complained about a malformed json:

ijson.common.IncompleteJSONError: parse error: after array element, I expect ',' or ']'
        :[]}},"lastrevid":1374358285}{"type":"item","id":"Q27","labe
                   (right here) ------

The script runs multiple threads in parallel, so it's able to "crash" on some threads while continuing on others, so I noticed that the error happens not only at that point, but also in a couple more places through the json.

I'm currently downloading the version that is .gz (rather than .bz2) to try running on it (not very hopefully, to be honest).

The last succesfully extraction happened at the beginning of January, on a .bz2 with size 61247031499 (I'm not able to find it in the dumps page)

I checked a JSON GZIP dump (/public/dumps/public/wikidatawiki/entities/20210301/wikidata-20210301-all.json.gz on Toolforge) and it has the same problem:

{"type":"item","id":"Q105741430","labels":{"ru":{"language":"ru","value":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430"}},"descriptions":{},"aliases":{},"claims":{},"sitelinks":{"ruwiki":{"site":"ruwiki","title":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430","badges":[]}},"lastrevid":1374357438}{"type":"item","id":"Q18",

Edit: by the way, that’s the same bad line as in /public/dumps/public/wikidatawiki/entities/20210301/wikidata-20210301-all.json.bz2

{"type":"item","id":"Q105741430","labels":{"ru":{"language":"ru","value":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430"}},"descriptions":{},"aliases":{},"claims":{},"sitelinks":{"ruwiki":{"site":"ruwiki","title":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430","badges":[]}},"lastrevid":1374357438}{"type":"item","id":"Q18",

– but not the same as in @Motagirl2’s comment (different lastrevid in first part and different ID in second part), so I’m guessing we’re looking at different dumps.

Hi @LucasWerkmeister ,

No, I encountered the error in several points in the json (my script is running in parallel through different threads, each thread containing different lines from the json, so each thread crashes at a different point ), including yours:

ijson.common.IncompleteJSONError: parse error: after array element, I expect ',' or ']'

:[]}},"lastrevid":1374357438}{"type":"item","id":"Q18","labe
           (right here) ------^

For the record, a list of all "error points" I've found:

:[]}},"lastrevid":1374357438}{"type":"item","id":"Q18","labe
:[]}},"lastrevid":1374358285}{"type":"item","id":"Q27","labe
s":{},"lastrevid":1374357261}{"type":"item","id":"Q44","labe
s":{},"lastrevid":1374357604}{"type":"item","id":"Q1","label
s":{},"lastrevid":1374359379}{"type":"item","id":"Q17","labe
s":{},"lastrevid":1374357722}{"type":"item","id":"Q15","labe
s":{},"lastrevid":1374358152}{"type":"item","id":"Q22","labe

Ah, I see. It’s very curious that all these “error points” correspond to very low item IDs…

That said, I quickly looked at the IDs of the first 1000 items in the dump, and none of the seven IDs in those “error points” were among them. So I don’t yet know whether those items are mysteriously appearing in the dump for a second time, or whether this is their first and only appearance, just without the ,\n that’s supposed to be there.

Can you guys confirm that https://dumps.wikimedia.org/wikidatawiki/entities/20210215/wikidata-20210215-all.json.bz2 does not have the issue? For me it works, but just checking as a checkpoint.

I’ve started my script (using this tool) on the 20210215 bzip2 and gz dumps, but if those dumps are intact, it’ll probably take a bit less than half a day until that’s confirmed. (Or maybe someone else is faster ^^)

This comment was removed by Motagirl2.

I am downloading it (3+ hours), my script will take about 5 or 6 hours. It's 7.35pm now in my timezone, so I hope to have some news in the morning :)

Alright, the 20210215 gz dump appears to be intact. (The bzip2 one hasn’t finished processing yet.)

Great thanks for verifying that. Matches what I see. I guess do we know what is causing this and when we can get a fix for it? If the dump file is broken I would imagine this is a high priority bug.

The 20210215 bz2 works perfectly 👍

The 20210215 bz2 works perfectly 👍

Yup, same here.

A workaround might be to insert sed 's/}{/},{/g' into the pipeline between bunzip2 and mongoimport. (Though that’ll probably at least slow down the import, since sed will run regexes against huge input lines.)

Change 669404 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] wikibase entity dumps: add comma at end of intermediate files

https://gerrit.wikimedia.org/r/669404

Change 669404 merged by ArielGlenn:
[operations/puppet@production] wikibase entity dumps: add comma at end of intermediate files

https://gerrit.wikimedia.org/r/669404

Will this patch be included in the next dump or can be put back in the last two dumps (regenerate dump)

Will this patch be included in the next dump or can be put back in the last two dumps (regenerate dump)

This should be in time for the dump that will be produced this week. For the previous two weeks you'll need to filter the contents to add in commas, as mentioned by Lucas in his earlier comment.

I'll leave this open until the run is complete and folks have had time to try to use them, so probably through the coming weekend.

I wrote a small script to fix these dumps (wikibase-json-dump-double-entry-fix) and that is currently running over the invalid dump (on stat1007). Once it is done, and we re-compressed the dump in both gzip and bzip2, we will provide a fixed up 20210301 JSON dump.

The fixed 20210301 JSON dumps (very same content, only the structure was fixed) can now be found on dumps.wikimedia.org (and the old broken ones are in /bad).

Since @hoo validated the dump from the past week, verifiying that the current dump generation process is fixed, we can now close this task. Thanks everyone!