Wikidata JSON dump (bz2) no longer imports due to bad JSON format
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Ash20001
	Mar 6 2021, 12:01 AM

Description

I normally have been ingesting wikidata json dump files into mongodb using mongo import. This has worked for a year or so and then the last two weekly dumps have failed with this error:

2021-03-05T15:35:17.320-0800 Failed: error reading separator after document #11554732: bad JSON array format - found '{' outside JSON object/array in input source
2021-03-05T15:35:17.320-0800 11553900 document(s) imported successfully. 0 document(s) failed to import.

The command I run is:
bunzip2 -dc ./wiki_job/latest-all.json.bz2 | mongoimport --host 127.0.0.1:27017 --db wikiData --collection wiki --type json --drop --numInsertionWorkers 4 --jsonArray

The dumps affected are March 3 and February 24 (as of 3-05-2021).
Feb 24th dump: https://dumps.wikimedia.org/wikidatawiki/entities/20210222/wikidata-20210222-all.json.bz2
March 3rd dump: https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2

I am not sure what has changed in the dump file but I have tried various mongoimport parameters but all exhibit the issue. The weekly dumps before Feb 24th are fine.

Details

	Subject	Repo	Branch	Lines +/-
	wikibase entity dumps: add comma at end of intermediate files	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects

Mentioned Here: T259067: Set up generation of JSON dumps for Wikimedia Commons

Event Timeline

Ash20001 created this task.Mar 6 2021, 12:01 AM

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptMar 6 2021, 12:01 AM

Mahir256 subscribed.Mar 6 2021, 7:20 AM

Lydia_Pintscher triaged this task as High priority.Mar 6 2021, 11:32 AM

Lydia_Pintscher added a project: Wikidata-Campsite.

Lydia_Pintscher moved this task from Incoming to Unconnected Stories on the Wikidata-Campsite board.

Lydia_Pintscher added a subscriber: Addshore.

Motagirl2 subscribed.Mar 6 2021, 4:16 PM

I have the same issue.
I have a script to extract entities from Wikidata dumps, that I've been running successfully for years.

The last time I ran it, on current latest-all.json.bz2 (03-Mar-2021 14:10, size 63323125695), it complained about a malformed json:

ijson.common.IncompleteJSONError: parse error: after array element, I expect ',' or ']'
        :[]}},"lastrevid":1374358285}{"type":"item","id":"Q27","labe
                   (right here) ------

The script runs multiple threads in parallel, so it's able to "crash" on some threads while continuing on others, so I noticed that the error happens not only at that point, but also in a couple more places through the json.

I'm currently downloading the version that is .gz (rather than .bz2) to try running on it (not very hopefully, to be honest).

The last succesfully extraction happened at the beginning of January, on a .bz2 with size 61247031499 (I'm not able to find it in the dumps page)

I checked a ~~JSON~~ GZIP dump (/public/dumps/public/wikidatawiki/entities/20210301/wikidata-20210301-all.json.gz on Toolforge) and it has the same problem:

{"type":"item","id":"Q105741430","labels":{"ru":{"language":"ru","value":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430"}},"descriptions":{},"aliases":{},"claims":{},"sitelinks":{"ruwiki":{"site":"ruwiki","title":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430","badges":[]}},"lastrevid":1374357438}{"type":"item","id":"Q18",…

Edit: by the way, that’s the same bad line as in /public/dumps/public/wikidatawiki/entities/20210301/wikidata-20210301-all.json.bz2 –

{"type":"item","id":"Q105741430","labels":{"ru":{"language":"ru","value":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430"}},"descriptions":{},"aliases":{},"claims":{},"sitelinks":{"ruwiki":{"site":"ruwiki","title":"\u041f\u0435\u0440\u0432\u043e\u0432\u043a\u0430","badges":[]}},"lastrevid":1374357438}{"type":"item","id":"Q18",…

– but not the same as in @Motagirl2’s comment (different lastrevid in first part and different ID in second part), so I’m guessing we’re looking at different dumps.

Hi @LucasWerkmeister ,

No, I encountered the error in several points in the json (my script is running in parallel through different threads, each thread containing different lines from the json, so each thread crashes at a different point ), including yours:

ijson.common.IncompleteJSONError: parse error: after array element, I expect ',' or ']'

:[]}},"lastrevid":1374357438}{"type":"item","id":"Q18","labe
           (right here) ------^

For the record, a list of all "error points" I've found:

:[]}},"lastrevid":1374357438}{"type":"item","id":"Q18","labe
:[]}},"lastrevid":1374358285}{"type":"item","id":"Q27","labe
s":{},"lastrevid":1374357261}{"type":"item","id":"Q44","labe
s":{},"lastrevid":1374357604}{"type":"item","id":"Q1","label
s":{},"lastrevid":1374359379}{"type":"item","id":"Q17","labe
s":{},"lastrevid":1374357722}{"type":"item","id":"Q15","labe
s":{},"lastrevid":1374358152}{"type":"item","id":"Q22","labe

Ah, I see. It’s very curious that all these “error points” correspond to very low item IDs…

That said, I quickly looked at the IDs of the first 1000 items in the dump, and none of the seven IDs in those “error points” were among them. So I don’t yet know whether those items are mysteriously appearing in the dump for a second time, or whether this is their first and only appearance, just without the ,\n that’s supposed to be there.

Might be related to the refactoring of the dumps that was done in T259067: Set up generation of JSON dumps for Wikimedia Commons.

Can you guys confirm that https://dumps.wikimedia.org/wikidatawiki/entities/20210215/wikidata-20210215-all.json.bz2 does not have the issue? For me it works, but just checking as a checkpoint.

I’ve started my script (using this tool) on the 20210215 bzip2 and gz dumps, but if those dumps are intact, it’ll probably take a bit less than half a day until that’s confirmed. (Or maybe someone else is faster ^^)

Motagirl2 added a comment.Mar 6 2021, 6:30 PM

This comment was removed by Motagirl2.

I am downloading it (3+ hours), my script will take about 5 or 6 hours. It's 7.35pm now in my timezone, so I hope to have some news in the morning :)

Alright, the 20210215 gz dump appears to be intact. (The bzip2 one hasn’t finished processing yet.)

Great thanks for verifying that. Matches what I see. I guess do we know what is causing this and when we can get a fix for it? If the dump file is broken I would imagine this is a high priority bug.

The 20210215 bz2 works perfectly 👍

In T276643#6889891, @Motagirl2 wrote:

The 20210215 bz2 works perfectly 👍

Yup, same here.

A workaround might be to insert sed 's/}{/},{/g' into the pipeline between bunzip2 and mongoimport. (Though that’ll probably at least slow down the import, since sed will run regexes against huge input lines.)

Change 669404 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] wikibase entity dumps: add comma at end of intermediate files

https://gerrit.wikimedia.org/r/669404

gerritbot added a project: Patch-For-Review.Mar 7 2021, 1:48 PM

Change 669404 merged by ArielGlenn:
[operations/puppet@production] wikibase entity dumps: add comma at end of intermediate files

https://gerrit.wikimedia.org/r/669404

Maintenance_bot removed a project: Patch-For-Review.Mar 7 2021, 3:10 PM

Will this patch be included in the next dump or can be put back in the last two dumps (regenerate dump)

In T276643#6890308, @Ash20001 wrote:

Will this patch be included in the next dump or can be put back in the last two dumps (regenerate dump)

This should be in time for the dump that will be produced this week. For the previous two weeks you'll need to filter the contents to add in commas, as mentioned by Lucas in his earlier comment.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Mar 8 2021, 6:01 AM

I'll leave this open until the run is complete and folks have had time to try to use them, so probably through the coming weekend.

Palotabarat subscribed.Mar 8 2021, 10:03 AM

Cparle subscribed.Mar 9 2021, 4:10 PM

Addshore moved this task from Inbox to External Realm on the [DEPRECATED] wdwb-tech board.Mar 9 2021, 11:41 PM

Tacsipacsi subscribed.Mar 10 2021, 1:09 AM

I wrote a small script to fix these dumps (wikibase-json-dump-double-entry-fix) and that is currently running over the invalid dump (on stat1007). Once it is done, and we re-compressed the dump in both gzip and bzip2, we will provide a fixed up 20210301 JSON dump.

The fixed 20210301 JSON dumps (very same content, only the structure was fixed) can now be found on dumps.wikimedia.org (and the old broken ones are in /bad).

Since @hoo validated the dump from the past week, verifiying that the current dump generation process is fixed, we can now close this task. Thanks everyone!

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Mar 23 2021, 6:42 AM

Wikidata JSON dump (bz2) no longer imports due to bad JSON formatClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related Objects

Event Timeline

Wikidata JSON dump (bz2) no longer imports due to bad JSON format
Closed, ResolvedPublicBUG REPORT
Actions