We should set up a regular dump of all structured data entities on Commons, akin to dumps of Wikidata entities we have now.
- Mentioned In
- T259067: Set up generation of JSON dumps for Wikimedia Commons
T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake
T255601: MediaInfo UI breaks when repoDatabase set to "testcommonswiki" instead of false
T251496: Validate and fix TTL dumps of SDoC
T243270: Test commons RDF dumps on sdcquery.wmflabs.org
T231952: [REQUEST] SDC metrics
T220525: MCR: Import all slots from XML dumps
T206884: Provide appropriate dumps of Commons including the structured data
- Mentioned Here
- T222995: Decide which prefixes to use for MediaInfo RDF
T253798: Commons RDF dump should use specific prefixes not the ones used by wikidata
T243292: Fix the munger to support commons RDF dump
T241149: rdfDump.php generates error messages when dumping for pages without mediainfo items
T222497: dumpRDF for MediaInfo entities loads each page individually
The refactor patchset now checks out with all the wikidata dumps including json. I'd like to deploy it this weekend, giving plenty of time to make sure it's ok, test the structured data patchset, and then be able to deploy that separately.
I sincerely apologize: this weekend the heat baked my brain and I did nothing related to computers at all. And Friday evening I was out. I'll set a notification to remind me this coming Friday earlier in the day, so that this gets done.
I think T222497 should be resolved before this goes live. I can test it in deployment-prep before then, but I don't want to do production tests until there is some sort of batching.
I checked on this week's run (which is with the refactor patch deployed) and didn't see anything amiss so far.
I've added fix for one of the issues in T222497 already but it doesn't fix everything. I think it's still would be interesting to test what happens in production - maybe not full dump but just partial, to estimate what we're dealing with and how bad is it? Maybe due to the fact we don't have yet too many mediainfo records and they're small we could be still fine?
Also, the patch itself doesn't actually turn the cron on, it just puts the files there. We'd need to flip the "files only" switch to actually produce the working cron.
I'm not thinking about the amount of time it takes, but rather the load on the database servers. Reasonable sized batched queries will be better, as I've seen already with stub dumps and slot retrieval.
I tried to manually dump the mediainfo entries over the weekend, it took 376 minutes for 4 shards (a lot, but less than I expected) and produces 1724656 items. Does not seem to produce significant load on DB so far - but it gives about 20 items/second, which seems to be too slow. If we ever get all files having items, that'd take 4 days to process over 8 shards, probably more since DB access will get slower, right now they are not to slow because there's only 2% of files that have items, so not too many DB queries.
@Abit: I need to get my last question on T241149 answered; if these errors only go to stderr then I can at least run a test dump, but if they go to logstash that's 50 million log entries as the task description says, which would be pretty unacceptable. @Cparle has said he could have a look at that in particular, but really anyone who knows that code can have a look.
Because I've gotten a nice run in beta with the --ignore-missing flag, I'm trying a test run on snapshot1008 in a screen session:
php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 50000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --ignore-missing 2>>/var/lib/dumpsgen/mediainfo-log.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard.gz
If the output looks good, I'll put it somewhere for WQS testing and move forward with making these weekly runs with the appropriate number of parallel processes.
A batchsize of 50k turned out to be too large. Same with 5k. I'm now running with a batchsize of 500, which will surely be too small, but at least I am getting output. I'll check on it tomorrow and see how it's doing.
This morning the job was terminated by the oom killer:
[4288057.417443] Out of memory: Kill process 117265 (php) score 868 or sacrifice child [4288057.425084] Killed process 117265 (php) total-vm:58241128kB, anon-rss:56901636kB, file-rss:0kB, shmem-rss:0kB
It produced a file of size 380M with 2224612 entitites in it before being shot. One of the last entries in it is the page File:Gerrardina_foliosa_1.jpg with page id 78 846 520 and mediainfo entity (Depicts) added on Jan 10th, 2020. Also the gz output file is not truncated,
so perhaps it is complete. but the max page id currently is 85 865 111 so everything created after the above page (after May 2019) is likely missing. @Abit Should I move the output file somewhere for folks to test with, or would it not be helpful?
Note to self that a run of
php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 250 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 0 --sharding-factor 1 2>/var/lib/dumpsgen/mediainfo-log-small.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard-small.gz
worked fine. Going to run one with a sharding factor of 4 and a batch size 4 times larger, to see how that is.
php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 1000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 1 --sharding-factor 4 2>/var/lib/dumpsgen/mediainfo-log-small-shard.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-of-4-small.gz
and it also ran fine.
php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 500 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --ignore-missing --first-page-id 78846320 --last-page-id 79046320 --shard 0 --sharding-factor 1 2>/var/lib/dumpsgen/mediainfo-log-small-shard-oom.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-small-oom.gz
which should cover the page range where we had the oom; it ran to completion fine. I guess that there is some small memory leak that must accumulate over batches, which is what did us in earlier. As long as we limit runs to some reasonable number of pages each time, we should be fine.
@EBernhardson or some other person from Discovery would now for certain.
Quickly looking at https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_reload_procedure, commands there mention ttl dumps though. Note that I don't really know what I am talking about, just trying to be useful.
I found a ticket that mentions use of ttl files so I'll run
/usr/local/bin/dumpwikibaserdf.sh commons full ttl
and keep an eye on it. Running on snapshot1008 in a screen session. Here we go!
In https://dumps.wikimedia.org/other/wikibase/commonswiki/ there are two ttl files, gz and bz2 compressed. Please have a look!
The bash script producing them complained that
/usr/local/bin/dumpwikibaserdf.sh: line 224: setDcatConfig: command not found
and I see that in commonsrdf_functions.sh there is a comment
# TODO: add DCAT info
so folks might want to look at that.
@ArielGlenn we plan to make a subtle change to the dump (prefixes), this won't be technically a breaking change but could cause some confusion if users start to assume the presence of some prefixes. Would it be possible to pause the publication of the dumps while we change this? Sorry for the late notice.
Just a note on the current problem:
the prefixes defined in ttl dumps are identical to the ones used by wikidata e.g.:
@prefix wdt: <http://commons.wikimedia.org/prop/direct/> .
This is perfectly valid but might cause some confusions because when using commons query service we will likely change these prefixes so that federation with wdqs is more obvious.
I had a look at the code base but haven't found an easy way to override this.
@Lucas_Werkmeister_WMDE do you know if it'd be possible to change the prefixes for the local repository such that for commons http://commons.wikimedia.org/prop/direct/ would not be wdt?
@dcausse it is possible to customize the prefix. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/569260 once merged will enable this. The chain of patches is a bit stalled. I will remove some dust off it and hopefully we'll get this in soon
Looks like it was decided not to use wikidata specific prefixes for MediaInfo exports but uses a more specific sdc for these (see: T222995).
The code does still look to be hardcoded with wikidata specific prefixes.
It does not look to me like that we could make this happen quickly.
I created T253798 to track this work. Since it seems that some refactoring will have to happen (initially we thought it might just be a config change) I wonder if making the dumps available should be blocked by T253798 or go ahead and make them available with a short notice explaining that prefixes might change in the future with a link to that same ticket.
@CBogen I'll leave that decision to you.
Please disregard the message above, this is actually possible with a config change.
Change 609114 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] snapshots: enable dumps of structured data from commons