Page MenuHomePhabricator

Create RDF dump of structured data on Commons
Open, MediumPublic


We should set up a regular dump of all structured data entities on Commons, akin to dumps of Wikidata entities we have now.

Related Objects


Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

No idea about JSON dumps, but I don't see any reason not to. I don't need them, but since it's Wikibase, it makes sense to have them too.

Should we have a ticket for misc cleanup like 'get rid of the BETA in filenames' and 'get rid of the 'legacy directory' stuff for the json dumps?

MB-one removed a subscriber: MB-one.Jun 19 2019, 9:15 AM

The gerrit change is ready for me to test now, probably by playing with it in beta a whole bunch.

I've added my reviews and updated to base on refactoring patch.

Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:25 PM

I plan to deploy the refactoring patchset Sunday in between runs (possibly today if the json dump and the others all finish up at a reasonable hour). I ran wikdiata dumps in beta with and without the changeset (with altered values for shard, minfilesize and batchsize) and the new dumps look fine.

Running into a new problem testing on beta.

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 2 --batch-size 2000 --format ttl  --no-cache --dbgroupdefault dump --part-id 5-0 --first-page-id 1 --last-page-id 16000   
Dumping entities of type item, property, lexeme, form, sense
Dumping shard 1/2
Wikimedia\Rdbms\DBConnectionError from line 1392 of /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: No working replica DB server: Unknown error
#0 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(467): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(897): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(false, Array, 'wikidatawiki')
#2 /srv/mediawiki/php-master/includes/site/DBSiteStore.php(78): Wikimedia\Rdbms\LoadBalancer->getConnection(-1)
#3 /srv/mediawiki/php-master/includes/site/DBSiteStore.php(65): DBSiteStore->loadSites()
#4 /srv/mediawiki/php-master/includes/site/CachingSiteStore.php(111): DBSiteStore->getSites()
#5 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpRdf.php(187): CachingSiteStore->getSites()
#6 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/DumpEntities.php(200): Wikibase\DumpRdf->createDumper(Resource id #716)
#7 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpRdf.php(138): Wikibase\DumpEntities->execute()
#8 /srv/mediawiki/php-master/maintenance/doMaintenance.php(99): Wikibase\DumpRdf->execute()
#9 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpRdf.php(202): require_once('/srv/mediawiki/...')
#10 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#11 {main}

That's new broken behavior. Same for json dumps:

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpJson.php --wiki wikidatawiki --shard 1 --sharding-factor 2 --batch-size 500 --snippet 2 --entity-type item --entity-type property --no-cache  --first-page-id 1
Dumping entities of type item, property
Dumping shard 1/2
Wikimedia\Rdbms\DBConnectionError from line 1392 of /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: No working replica DB server: Unknown error
#0 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(467): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-master/includes/libs/rdbms/loadbalancer/LoadBalancer.php(897): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(false, Array, 'wikidatawiki')
#2 /srv/mediawiki/php-master/includes/GlobalFunctions.php(2568): Wikimedia\Rdbms\LoadBalancer->getConnection(-1, Array, 'wikidatawiki')
#3 /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Store/Sql/SqlEntityIdPager.php(104): wfGetDB(-1)
#4 /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php(257): Wikibase\Repo\Store\Sql\SqlEntityIdPager->fetchIds(500)
#5 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/DumpEntities.php(221): Wikibase\Dumpers\DumpGenerator->generateDump(Object(Wikibase\Repo\Store\Sql\SqlEntityIdPager))
#6 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpJson.php(100): Wikibase\DumpEntities->execute()
#7 /srv/mediawiki/php-master/maintenance/doMaintenance.php(99): Wikibase\DumpJson->execute()
#8 /srv/mediawiki/php-master/extensions/Wikibase/repo/maintenance/dumpJson.php(126): require_once('/srv/mediawiki/...')
#9 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#10 {main}

But I can request a replica and apparently get one:

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php maintenance/getReplicaServer.php --wiki wikidatawiki

Must be missing something but I don't see it.

hoo added a comment.Jul 10 2019, 5:17 PM

Apparently this is related to the --dbgroupdefault (doesn't seem to depend on the value):

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki --dbgroupdefault dump
> echo get_class(wfGetDB( DB_REPLICA ));
Caught exception Wikimedia\Rdbms\DBConnectionError: Cannot access the database: No working replica DB server: Unknown error
dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php eval.php --wiki wikidatawiki
> echo get_class(wfGetDB( DB_REPLICA ));

This is also going to always hit the Wikibase dumpers as they set the dbgroupdefault even if not given via CLI.

Ah I didn't even think about the param being set in the script. I had removed it during testing to see if that helped, and... nada.

Change 521920 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/core@master] LoadBalancer::getConnectionIndex: Also fallback to generic group

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jul 11 2019, 4:52 AM

Change 521920 abandoned by Hoo man:
LoadBalancer::getConnectionIndex: Also fallback to generic group

Better solved through /523412

The refactor patchset now checks out with all the wikidata dumps including json. I'd like to deploy it this weekend, giving plenty of time to make sure it's ok, test the structured data patchset, and then be able to deploy that separately.

Smalyshev claimed this task.Aug 1 2019, 6:53 PM

I sincerely apologize: this weekend the heat baked my brain and I did nothing related to computers at all. And Friday evening I was out. I'll set a notification to remind me this coming Friday earlier in the day, so that this gets done.

Change 517670 merged by ArielGlenn:
[operations/puppet@production] refactor wikidata entity dumps into wikibase + wikidata specific bits

I think T222497 should be resolved before this goes live. I can test it in deployment-prep before then, but I don't want to do production tests until there is some sort of batching.

I checked on this week's run (which is with the refactor patch deployed) and didn't see anything amiss so far.

I've added fix for one of the issues in T222497 already but it doesn't fix everything. I think it's still would be interesting to test what happens in production - maybe not full dump but just partial, to estimate what we're dealing with and how bad is it? Maybe due to the fact we don't have yet too many mediainfo records and they're small we could be still fine?

Also, the patch itself doesn't actually turn the cron on, it just puts the files there. We'd need to flip the "files only" switch to actually produce the working cron.

I'm not thinking about the amount of time it takes, but rather the load on the database servers. Reasonable sized batched queries will be better, as I've seen already with stub dumps and slot retrieval.

I tried to manually dump the mediainfo entries over the weekend, it took 376 minutes for 4 shards (a lot, but less than I expected) and produces 1724656 items. Does not seem to produce significant load on DB so far - but it gives about 20 items/second, which seems to be too slow. If we ever get all files having items, that'd take 4 days to process over 8 shards, probably more since DB access will get slower, right now they are not to slow because there's only 2% of files that have items, so not too many DB queries.

Smalyshev removed Smalyshev as the assignee of this task.Sep 4 2019, 5:52 AM
Abit added a subscriber: Abit.Jan 8 2020, 7:57 PM

@ArielGlenn What is this blocked or stalled on? Looks like several of the subtasks have been closed, but not all.

@Abit: I need to get my last question on T241149 answered; if these errors only go to stderr then I can at least run a test dump, but if they go to logstash that's 50 million log entries as the task description says, which would be pretty unacceptable. @Cparle has said he could have a look at that in particular, but really anyone who knows that code can have a look.

Because I've gotten a nice run in beta with the --ignore-missing flag, I'm trying a test run on snapshot1008 in a screen session:

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 50000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing 2>>/var/lib/dumpsgen/mediainfo-log.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard.gz

If the output looks good, I'll put it somewhere for WQS testing and move forward with making these weekly runs with the appropriate number of parallel processes.

A batchsize of 50k turned out to be too large. Same with 5k. I'm now running with a batchsize of 500, which will surely be too small, but at least I am getting output. I'll check on it tomorrow and see how it's doing.

ArielGlenn added a comment.EditedJan 13 2020, 10:06 AM

This morning the job was terminated by the oom killer:

[4288057.417443] Out of memory: Kill process 117265 (php) score 868 or sacrifice child
[4288057.425084] Killed process 117265 (php) total-vm:58241128kB, anon-rss:56901636kB, file-rss:0kB, shmem-rss:0kB

It produced a file of size 380M with 2224612 entitites in it before being shot. One of the last entries in it is the page File:Gerrardina_foliosa_1.jpg with page id 78 846 520 and mediainfo entity (Depicts) added on Jan 10th, 2020. Also the gz output file is not truncated, so perhaps it is complete. but the max page id currently is 85 865 111 so everything created after the above page (after May 2019) is likely missing. @Abit Should I move the output file somewhere for folks to test with, or would it not be helpful?

Note to self that a run of

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 250 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 0 --sharding-factor 1  2>/var/lib/dumpsgen/mediainfo-log-small.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard-small.gz

worked fine. Going to run one with a sharding factor of 4 and a batch size 4 times larger, to see how that is.


php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 1000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 1 --sharding-factor 4  2>/var/lib/dumpsgen/mediainfo-log-small-shard.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-of-4-small.gz

and it also ran fine.


php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 500 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 78846320 --last-page-id 79046320 --shard 0 --sharding-factor 1  2>/var/lib/dumpsgen/mediainfo-log-small-shard-oom.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-small-oom.gz

which should cover the page range where we had the oom; it ran to completion fine. I guess that there is some small memory leak that must accumulate over batches, which is what did us in earlier. As long as we limit runs to some reasonable number of pages each time, we should be fine.

Change 516444 merged by ArielGlenn:
[operations/puppet@production] Set up dumps for mediainfo RDF generation

I plan to try running

/usr/local/bin/ commons full nt

on Thursday morning and see how long it takes with the 8 shards that are currently configured. @Abit is the nt format the one needed for WDQS testing?

@Abit is the nt format the one needed for WDQS testing?

I cannot answer this. @EBernhardson, any idea?

@EBernhardson or some other person from Discovery would now for certain.
Quickly looking at, commands there mention ttl dumps though. Note that I don't really know what I am talking about, just trying to be useful.

I found a ticket that mentions use of ttl files so I'll run

/usr/local/bin/ commons full ttl

and keep an eye on it. Running on snapshot1008 in a screen session. Here we go!

ArielGlenn added a comment.EditedJan 16 2020, 2:13 PM

In there are two ttl files, gz and bz2 compressed. Please have a look!

The bash script producing them complained that

/usr/local/bin/ line 224: setDcatConfig: command not found

and I see that in there is a comment

# TODO: add DCAT info

so folks might want to look at that.

@dcausse is going to check over the ttl dump and let me know if it looks ok; if so then I'll flip the switch for generation weekly and make sure there's cleanup too.

Some unexpected (?) triples popping up that @dcausse is looking into, so the dumps will not be turned on in cron until we have the thumbs up on that. See T243292

If it turns out the data is all ok, we can move forward.

Cparle added a comment.EditedFeb 12 2020, 4:43 PM

@ArielGlenn the structured data team keep coming up as blockers on this ticket in scrum-of-scrums, but we're not blocking you, are we?

If not then maybe I'll remove the tags relevant to us

@Cparle, No blocks on your side, the ball is now in @dcausse 's court. :-)

Hi, just checking in: any progress on invetigating the 'extra' dumps content?

@ArielGlenn no not yet, this is still blocked on T243292 which requires some investigation to determine which component (dump or the wdqs transformation process) is wrong.

I see that we're no longer blocked. Does this mean that we're good to go for weekly runs?

Gehel assigned this task to ArielGlenn.May 26 2020, 3:43 PM

We've loaded the dump on a test server and it looks fine. We can start scheduling the dumps regularly.

Change 598787 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] enable dumps of structured data from commons

Change 598787 merged by ArielGlenn:
[operations/puppet@production] enable dumps of structured data from commons

Change 599052 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] fix up cron name of commons structured data dumps

Change 599052 merged by ArielGlenn:
[operations/puppet@production] fix up cron name of commons structured data dumps

@ArielGlenn we plan to make a subtle change to the dump (prefixes), this won't be technically a breaking change but could cause some confusion if users start to assume the presence of some prefixes. Would it be possible to pause the publication of the dumps while we change this? Sorry for the late notice.

@dcausse what's your time frame?

Just a note on the current problem:
the prefixes defined in ttl dumps are identical to the ones used by wikidata e.g.:

@prefix wdt: <> .

This is perfectly valid but might cause some confusions because when using commons query service we will likely change these prefixes so that federation with wdqs is more obvious.
I had a look at the code base but haven't found an easy way to override this.
@Lucas_Werkmeister_WMDE do you know if it'd be possible to change the prefixes for the local repository such that for commons would not be wdt?

@dcausse it is possible to customize the prefix. once merged will enable this. The chain of patches is a bit stalled. I will remove some dust off it and hopefully we'll get this in soon

dcausse added a subscriber: CBogen.EditedMay 27 2020, 7:11 PM

Looks like it was decided not to use wikidata specific prefixes for MediaInfo exports but uses a more specific sdc for these (see: T222995).
The code does still look to be hardcoded with wikidata specific prefixes.
It does not look to me like that we could make this happen quickly.
I created T253798 to track this work. Since it seems that some refactoring will have to happen (initially we thought it might just be a config change) I wonder if making the dumps available should be blocked by T253798 or go ahead and make them available with a short notice explaining that prefixes might change in the future with a link to that same ticket.
@CBogen I'll leave that decision to you.

Please disregard the message above, this is actually possible with a config change.

@WMDE-leszek oops, sorry I replied before reading you comment and was reading an old code base... if this is just a config change it can hopefully be merged soon. Thanks!

Change 599856 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Revert "enable dumps of structured data from commons"

Change 599856 merged by Gehel:
[operations/puppet@production] Revert "enable dumps of structured data from commons"

Change 601162 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] sdc dumps: add placeholder entry in dcat setup to avoid syntax errors

Change 601162 merged by ArielGlenn:
[operations/puppet@production] sdc dumps: add placeholder entry in dcat setup to avoid syntax errors

Change 609114 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] snapshots: enable dumps of structured data from commons /609114

Change 609114 merged by Gehel:
[operations/puppet@production] snapshots: enable dumps of structured data from commons /609114