Page MenuHomePhabricator

Set up generation of JSON dumps for Wikimedia Commons
Closed, ResolvedPublic

Description

The wmf.wikidata_entity table in the Data Lake is driven by a JSON dump of Wikidata. To create a similar table for structured data on Commons, we'd need a JSON dump of Commons. T221917 set up RDF dumps, but not a JSON dump.

The parent task contains some known use cases that would benefit from having this data available.

Event Timeline

This would only partly be me; once the code is available in the extension to produce output of the right format, I'd be able to adjust existing manifests in puppet and make the dumps run in production, much like the Wikibase rdf/json dumps.

The wikibase json dump script seems to work just fine for this locally at least

mwscript extensions/Wikibase/repo/maintenance/dumpJson.php --entity-type=mediainfo

Is that adequate for you @ArielGlenn ?

That should be fine if you know that the output is good. $someone needs to look at the existing rdf scripts https://github.com/wikimedia/puppet/tree/production/modules/snapshot/files/cron/wikibase (and https://github.com/wikimedia/puppet/tree/production/modules/snapshot/files/cron ) and do the same thing for the json one, creating a shared functions script for json dumps and then making a general dumpwikibasejson.sh script that depending on the args will dump commons or wikidata output. If $someone remains null for long enough, I'll take care of it :-) It will be at least two weeks before I can get to this though.

$someone seems to still be null 😄 Do you have time for this @ArielGlenn ?

I'll put it in my queue, without cookie-licking it however. It would be best if someone working with structured data on commons could spend some time, so that I'm not the only one with knowledge about these scripts.

@Cparle could be your apprentice 😃

I'll put it in my queue, without cookie-licking it however. It would be best if someone working with structured data on commons could spend some time, so that I'm not the only one with knowledge about these scripts.

Change 629121 had a related patch set uploaded (by Cparle; owner: Cparle):
[operations/puppet@production] Generation of json dumps for wikimedia commons

https://gerrit.wikimedia.org/r/629121

Hey @ArielGlenn this has been in our code review column for a long while now, you reckon you'll get to it soon?

It's been on my todo list every day for the same length of time. But, yes. soon. In particular I want to get this tested on the new buster host.

Change 629121 merged by ArielGlenn:
[operations/puppet@production] Generation of json dumps for wikimedia commons

https://gerrit.wikimedia.org/r/629121

Well, the json ones are being generated with the name "all" instead of "mediainfo" like the rdf ones, which is my error. I have a patch in to correct this, https://gerrit.wikimedia.org/r/c/operations/puppet/+/665989/ I'll need to fix up the names and links by hand once they are complete, on both dumpsdata servers and the labstore boxes. Meh. After that though, I expect we'll be good to go.

I have renamed everything and it's all rsynced out to the public-facing servers. You should be able to see the contents at https://dumps.wikimedia.org/other/wikibase/commonswiki/20210222/ @Cparle Care to have a look at the contents?

Thanks SO much @ArielGlenn, I am also downloading those on our stats machine and will check them once they are in!

@ArielGlenn thanks a lot again for this!

One question: would it be possible to store the "title" field in the JSON Blob as well?

For example for this MediaInfo entity: https://commons.wikimedia.org/wiki/Special:EntityData/M72261258.json we would have, on top of the fields currently stored in the dumps, an additional field "title": "File:Catedral de San Juan, Breslavia, Polonia, 2017-12-20, DD 09-11 HDR.jpg"

This would make it much easier to then join this info with other image properties in the mediawiki database.

Many thanks!

--snip--

One question: would it be possible to store the "title" field in the JSON Blob as well?

For example for this MediaInfo entity: https://commons.wikimedia.org/wiki/Special:EntityData/M72261258.json we would have, on top of the fields currently stored in the dumps, an additional field "title": "File:Catedral de San Juan, Breslavia, Polonia, 2017-12-20, DD 09-11 HDR.jpg"

This would make it much easier to then join this info with other image properties in the mediawiki database.

Many thanks!

This is a question for Cormac or whoever maintains the extensions/Wikibase/repo/maintenance/dumpJson.php script, but as I understand it, a format change to the commons json-format entity dumps would mean a change to the widata entity dumps as well, since these are both wikibase dumps under the hood.

@ArielGlenn thanks for clarifying this. I chatted with @Cormac on Slack and he explained how to get the image page_id from the current entity data. We can get this info from the "id" field. For example, a mediainfo slot with an id of M12345 corresponds to a page with an id 12345. Thanks both!

Is there anything left to be done for this task? @Cparle got anything on your todo list for you or me? If not, we could close this out :-)

Nothing left to do afaik. Hooray! I'll close the ticket