If you compare output of https://commons.wikimedia.org/wiki/Special:EntityData/M76.json and the same entity in the Wikimedia commons structured data (entities) dump, you will notice that some fields are missing. The most important for me is "title" field, which tells you to which file the entity belongs to. Without it it is hard to determine what is the entity about (you can infer from the identity ID, because number in there matches page_id, but that requires additional data to resolve).
Description
Details
Related Objects
Event Timeline
I added it to wikimedia-hackathon-2022. I think it would be a nice thing to fix as part of it.
@Mitar: Just to avoid misunderstandings, do you plan to work on this at the Hackathon? :)
I would be interesting in doing that, but I probably need a helping hand to do it. So I have programming background, but zero understanding of where and how this could be fixed. My understanding is that hackathon would be suitable for this? Do I have to make a session? How do I find other people who might be able to help me?
@Mitar: Ah, great! No session needed in my understanding. See https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2022/Participants and https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2022/How_to for finding people - thanks!
Just had a chat with @Mitar on IRC about a possible approach to this and they will write something up here now! :) Wikimedia-Hackathon-2022
So the plan is:
- Take addPageInfoToRecord from repo/includes/Api/ResultBuilder.php (https://github.com/wikimedia/Wikibase/blob/44b2d731c507d40472cf6f1392bc378166e2a45f/repo/includes/Api/ResultBuilder.php#L354-L362) and move it out to repo/includes.
- Then have both API and dump generation in generateDumpForEntityId call into id to add those fields to the dump entity as well. Currently there is $data['lastrevid'] = $revision->getRevisionId(); already in generateDumpForEntityId, but reusing that function will also add title and few other fields (ns, modified, pageid), which I think is great to get parity between API and dumps.
- It was suggested to me that this should be behind a flag/setting, I am not sure if that is really needed, but I will then add it as opt-out setting?
Change 793934 had a related patch set uploaded (by Mitar; author: Mitar):
[mediawiki/extensions/Wikibase@master] Make sure both API and dump include same page metadata fields
I made a first pass. Feedback welcome.
I ran tests using composer run-script test, but I feel like they have not really run. I had a bug in JsonDumpGeneratorTest and no error was reported. I am also not sure if getMockEntityTitleStoreLookup has been correctly implemented.
I made another pass, adding configuration option to not include page metadata (then dump is without title and other page metadata).
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/793934 is ready for a review, it has both opt-in configuration option and a test.
Change 793934 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Make sure both API and dump include same page metadata fields.
So fix to the dump script has been merged to the Wikibase extension. It is gated behind a CLI switch. What is the process that this gets turned on for dumps from Wikimedia Commons (and ideally also for Wikidata)?
After the code change rolls out with the deployment train next week, you could submit a Gerrit change to Puppet to add the flag (dumpwikibasejson.sh, found via codesearch), and add it to a Puppet request window. Should be safe enough to try it out for one week’s dumps – if it doesn’t work as expected, or degrades performance by too much, it can be reverted again.
I can also try out the change in production (create a tiny partial dump just to see what the JSON looks like), once the code change has rolled out (next Thursday or Friday, probably). Feel free to remind me if I forget ^^
Change 802921 had a related patch set uploaded (by Mitar; author: Mitar):
[operations/puppet@production] Add page metadata to Wikibase JSON dumps
Done. Added it to June 7 puppet request window. Please review/advise if I did something wrong.
Seems to work in production:
lucaswerkmeister-wmde@mwdebug1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --limit 1 --snippet 2>/dev/null | jq . | tail "badges": [] }, "hewikiquote": { "site": "hewikiquote", "title": "אפריקה", "badges": [] } }, "lastrevid": 1652816527 } lucaswerkmeister-wmde@mwdebug1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --limit 1 --snippet --page-metadata 2>/dev/null | jq . | tail "title": "אפריקה", "badges": [] } }, "pageid": 111, "ns": 0, "title": "Q15", "lastrevid": 1652816527, "modified": "2022-06-02T12:56:04Z" }
Just flagging up that this was originally for commons data dumps, but I think this puppet change would be both for wikidata and common.
@Lucas_Werkmeister_WMDE I guess Lydia should give this a stamp of approval.
I wonder if it would increase the compressed dump size much? (probably not)
🆗
I wonder if it would increase the compressed dump size much? (probably not)
Yeah that is also my only worry. But I fear we'll have to deal with subsetting sooner or later anyway now.
I wonder if it would increase the compressed dump size much? (probably not)
Yeah that is also my only worry. But I fear we'll have to deal with subsetting sooner or later anyway now.
I grabbed a set of 1000 500 random entity IDs from the API (P29604) and ran a dump for them; if I did my maths right, the dumps would grow by ca. 0.5% before compression:
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | wc -c 8160550 lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | wc -c 8200010
Or by ca. 2% after compression:
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | gzip -9 | wc -c 736735 lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | gzip -9 | wc -c 751205 lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | bzip2 | wc -c 521147 lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | bzip2 | wc -c 532060
I tried it with a larger set of 10000 entity IDs (P29605) and got fairly similar results – 0.5% before compression, 1.9% after gzip, 1.7% after bzip2. For comparison, between 2022-04-25 to 2022-06-06 (these are the least and most recent dumps we still store), the dumps grew by 1.04% (wikidata-20220425-all.json.gz: 111585957308 bytes; wikidata-20220606-all.json.gz: 112743720511 bytes). So this would be a jump ahead, in terms of dump size, of roughly three months, so to speak. (Or: when we deploy this patch, the next dump will immediately reach a size that we otherwise wouldn’t expect for roughly three months.)
That’s more than I would’ve expected, but tolerable and on the whole still worth it, I think. (As Lydia says, the need for subsetting already looms on the horizon with or without this change.)
It is about providing several dumps, each containing a subset of the data, to make them smaller and easier to work with.
@Mitar your CR is all approved, please ping me on irc (jbond) when you are around and i can merge
Awesome. I will try to do so when you are online, but feel free also to just merge it without me. I do not know if I can be of much help being around anyway. :-)
Change 802921 merged by Jbond:
[operations/puppet@production] Add page metadata to Wikibase JSON dumps
So for the next dump which will run, this will now be included? Or is there some deployment which is still necessary?
I checked commons-20220620-mediainfo.json.bz2 and it contains title field (alongside other fields which are present in API).
And the dumps size has indeed increased somewhat:
date | gz | bz2 |
---|---|---|
20220530 | 112573430297 | 74122394696 |
20220606 | 112743720511 | 74236824815 |
20220613 | 113026643896 | 74443521341 |
20220620 | 115432876924 | 75966141297 |
That’s 2.1% for gz and 2.0% for bz2 compared to the last dump without these fields (so the 2% growth comes from the field and any extra data combined).