Page MenuHomePhabricator

Create RDF dump of structured data on Commons
Open, NormalPublic

Description

We should set up a regular dump of all structured data entities on Commons, akin to dumps of Wikidata entities we have now.

Related Objects

StatusAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenNone
ResolvedAbit
OpenNone
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
OpenGehel
ResolvedSmalyshev
OpenNone
OpenNone
OpenNone

Event Timeline

Smalyshev triaged this task as Normal priority.May 1 2019, 8:08 PM

Change 516441 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add option --ignore-missing to dumper

https://gerrit.wikimedia.org/r/516441

Change 516444 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Set up dumps for mediainfo RDF generation

https://gerrit.wikimedia.org/r/516444

A couple thoughts on the above patchset:

Can we get away with a 'dumpwikibaserdf.sh' and some well-chosen variables for both wikidata and commons (or at worst two much much shorter scripts) instead of a whole new dumpcommonsrdf.sh?

Additionally, we already have media, mediatitles, mediacounts and imageinfo being served out of other/. Mediainfo is the first name I would have chosen too but maybe we can find something else that will keep future dump maintainers from swearing at us.

Last comment is that I really dislike the links under xmldatadumps/public/{wikidata,commonswiki}. That directory should be only for xml/sql dumps and we should be moving away from having links to other stuff there, not adding new ones. We need a plan...

I'm trying to think of these dumps in the context of other projects possibly using structured data at some point, and how we can facilitate new dumps in the future.

Can we get away with a 'dumpwikibaserdf.sh' and some well-chosen variables for both wikidata and commons (or at worst two much much shorter scripts) instead of a whole new dumpcommonsrdf.sh?

I thought about it, and it might be possible, but it would make the script even more complex and even less readable, as it would be pretty much everything variables on top of variables on top of variables. That's why I didn't go that route - too easy to make an error while piling those variables on top of each other, and would be too hard to find one... I understand copypasting is usually bad, but I wonder if we want to tolerate it for the sake of readability. If you think it's a must I can redo it using common script but that would probably make the common part harder to understand and verify.

Mediainfo is the first name I would have chosen too but maybe we can find something else that will keep future dump maintainers from swearing at us.

That's a good time for proposals, while we have not committed to anything yet. That said, given that it'd be in other/wikibase/commonswiki, I don't think it can be confused with anything else.

Last comment is that I really dislike the links under xmldatadumps/public/{wikidata,commonswiki}. That directory should be only for xml/sql dumps and we should be moving away from having links to other stuff there, not adding new ones. We need a plan...

I just did what wikidata dump is already doing. I do not have any opinion on which way is better, but I think it should be symmetric for both RDF dumps - either both have it, or none has it. So you are welcome to propose any model that you prefer, I do not have preferences here except for consistency.

ArielGlenn added a subscriber: hoo.Jun 12 2019, 7:08 AM

I'm going to have a look at the code duplication a bit over the next few days and see if I can have a coounterproposal patch. If not then we'll just go ahead, I totally understand where you're coming from.

As far as the links go, I'd like to drag @hoo back into this conversation, since those links under xmldatadumps/public/blah are so-called legacy links, and legacy means a plan to get rid of them. So let's get a plan! :-)

Change 516441 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add option --ignore-missing to dumper

https://gerrit.wikimedia.org/r/516441

Change 517670 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] refactor wikidata entity dumps into wikibase + wikidata specific bits

https://gerrit.wikimedia.org/r/517670

While I"m sure the above is quite broken in a variety of ways, this is the sort of thing I had in mind, being able to drop in one file with just values specific to commons (or whatever wikibase thing might come our way later), and change the project name only, getting everything else 'for free'. If we wind up wanting json dumps for commons/future projects, it should not be hard to do something similar for them. I personally would like to see json dumps happen btw; is that on the road map?

No idea about JSON dumps, but I don't see any reason not to. I don't need them, but since it's Wikibase, it makes sense to have them too.

Should we have a ticket for misc cleanup like 'get rid of the BETA in filenames' and 'get rid of the 'legacy directory' stuff for the json dumps?

MB-one removed a subscriber: MB-one.Jun 19 2019, 9:15 AM

The gerrit change is ready for me to test now, probably by playing with it in beta a whole bunch.

I've added my reviews and updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/516444 to base on refactoring patch.

Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:25 PM

I plan to deploy the refactoring patchset Sunday in between runs (possibly today if the json dump and the others all finish up at a reasonable hour). I ran wikdiata dumps in beta with and without the changeset (with altered values for shard, minfilesize and batchsize) and the new dumps look fine.

Running into a new problem testing on beta.

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 2 --batch-size 2000 --format ttl  --no-cache --dbgroupdefault dump --part-id 5-0 --first-page-id 1 --last-page-id 16000   
Dumping entities of type item, property, lexeme, form, sense
Dumping shard 1/2
Wikimedia\Rdbms\DBConnectionError from line 1392 of /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: No working replica DB server: Unknown error
#0 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(467): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(897): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(false, Array, 'wikidatawiki')
#2 /srv/mediawiki/php-master/includes/site/DBSiteStore.php(78): Wikimedia\Rdbms\LoadBalancer->getConnection(-1)
#3 /srv/mediawiki/php-master/includes/site/DBSiteStore.php(65): DBSiteStore->loadSites()
#4 /srv/mediawiki/php-master/includes/site/CachingSiteStore.php(111): DBSiteStore->getSites()
#5 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpRdf.php(187): CachingSiteStore->getSites()
#6 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/DumpEntities.php(200): Wikibase\DumpRdf->createDumper(Resource id #716)
#7 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpRdf.php(138): Wikibase\DumpEntities->execute()
#8 /srv/mediawiki/php-master/maintenance/doMaintenance.php(99): Wikibase\DumpRdf->execute()
#9 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpRdf.php(202): require_once('/srv/mediawiki/...')
#10 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#11 {main}

That's new broken behavior. Same for json dumps:

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpJson.php --wiki wikidatawiki --shard 1 --sharding-factor 2 --batch-size 500 --snippet 2 --entity-type item --entity-type property --no-cache  --first-page-id 1
Dumping entities of type item, property
Dumping shard 1/2
Wikimedia\Rdbms\DBConnectionError from line 1392 of /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: No working replica DB server: Unknown error
#0 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(467): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(897): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(false, Array, 'wikidatawiki')
#2 /srv/mediawiki/php-master/includes/GlobalFunctions.php(2568): Wikimedia\Rdbms\LoadBalancer->getConnection(-1, Array, 'wikidatawiki')
#3 /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Store/Sql/SqlEntityIdPager.php(104): wfGetDB(-1)
#4 /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php(257): Wikibase\Repo\Store\Sql\SqlEntityIdPager->fetchIds(500)
#5 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/DumpEntities.php(221): Wikibase\Dumpers\DumpGenerator->generateDump(Object(Wikibase\Repo\Store\Sql\SqlEntityIdPager))
#6 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpJson.php(100): Wikibase\DumpEntities->execute()
#7 /srv/mediawiki/php-master/maintenance/doMaintenance.php(99): Wikibase\DumpJson->execute()
#8 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpJson.php(126): require_once('/srv/mediawiki/...')
#9 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#10 {main}

But I can request a replica and apparently get one:

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php maintenance/getReplicaServer.php --wiki wikidatawiki
deployment-db06

Must be missing something but I don't see it.

hoo added a comment.Jul 10 2019, 5:17 PM

Apparently this is related to the --dbgroupdefault (doesn't seem to depend on the value):

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki --dbgroupdefault dump
> echo get_class(wfGetDB( DB_REPLICA ));
Caught exception Wikimedia\Rdbms\DBConnectionError: Cannot access the database: No working replica DB server: Unknown error
…
dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki
> echo get_class(wfGetDB( DB_REPLICA ));
Wikimedia\Rdbms\DatabaseMysqli

This is also going to always hit the Wikibase dumpers as they set the dbgroupdefault even if not given via CLI.

Ah I didn't even think about the param being set in the script. I had removed it during testing to see if that helped, and... nada.

Change 521920 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/core@master] LoadBalancer::getConnectionIndex: Also fallback to generic group

https://gerrit.wikimedia.org/r/521920

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jul 11 2019, 4:52 AM

Change 521920 abandoned by Hoo man:
LoadBalancer::getConnectionIndex: Also fallback to generic group

Reason:
Better solved through https://gerrit.wikimedia.org/r/c/mediawiki/core/ /523412

https://gerrit.wikimedia.org/r/521920

The refactor patchset now checks out with all the wikidata dumps including json. I'd like to deploy it this weekend, giving plenty of time to make sure it's ok, test the structured data patchset, and then be able to deploy that separately.

Smalyshev claimed this task.Aug 1 2019, 6:53 PM

I sincerely apologize: this weekend the heat baked my brain and I did nothing related to computers at all. And Friday evening I was out. I'll set a notification to remind me this coming Friday earlier in the day, so that this gets done.

Change 517670 merged by ArielGlenn:
[operations/puppet@production] refactor wikidata entity dumps into wikibase + wikidata specific bits

https://gerrit.wikimedia.org/r/517670

I think T222497 should be resolved before this goes live. I can test it in deployment-prep before then, but I don't want to do production tests until there is some sort of batching.

I checked on this week's run (which is with the refactor patch deployed) and didn't see anything amiss so far.

I've added fix for one of the issues in T222497 already but it doesn't fix everything. I think it's still would be interesting to test what happens in production - maybe not full dump but just partial, to estimate what we're dealing with and how bad is it? Maybe due to the fact we don't have yet too many mediainfo records and they're small we could be still fine?

Also, the patch itself doesn't actually turn the cron on, it just puts the files there. We'd need to flip the "files only" switch to actually produce the working cron.

I'm not thinking about the amount of time it takes, but rather the load on the database servers. Reasonable sized batched queries will be better, as I've seen already with stub dumps and slot retrieval.

I tried to manually dump the mediainfo entries over the weekend, it took 376 minutes for 4 shards (a lot, but less than I expected) and produces 1724656 items. Does not seem to produce significant load on DB so far - but it gives about 20 items/second, which seems to be too slow. If we ever get all files having items, that'd take 4 days to process over 8 shards, probably more since DB access will get slower, right now they are not to slow because there's only 2% of files that have items, so not too many DB queries.

Smalyshev removed Smalyshev as the assignee of this task.Wed, Sep 4, 5:52 AM