Having batched Title-from-EntityId lookups for mediainfo entities to reduce the number of queries during dumps (see T222497), it turns out that the MediaInfo code still runs Title:newFromId() twice for each entity being dumped. It only gets run once for Items. Needs to be fixed
|mediawiki/extensions/WikibaseMediaInfo||master||+75 -44||Prevent duplicate calls to Title:newFromID during dumps|
Doing some initial testing on beta. as the dumpsgen user from snapshot01.
For one output file with one shard:
php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --shard 1 --sharding-factor 2 --batch-size 70000 --format ttl --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --part-id '1-2' 1 70000 2>> /var/lib/dumpsgen/mediainfo-dumps-log-ttl.txt | gzip > /mnt/dumpsdata/xmldatadumps/temp/mediainfo-dumps-test-ttl.gz php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --shard 1 --sharding-factor 2 --batch-size 70000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --part-id '1-2' 1 70000 2>> /var/lib/dumpsgen/mediainfo-dumps-log-nt.txt | gzip > /mnt/dumpsdata/xmldatadumps/temp/mediainfo-dumps-test-nt.gz
Haven't checked query execution yet but I did notice one thing: for pages (in namespace 6) with no mediainfo slot, an error message is logged of he form:
"[failed-to-dump]: Failed to dump M70620 (Entity not found: M70620)"
Given that most pages on commons don't have mediainfo slots, this is going to be 50 million log entries. Can we silence the specific entry somehow, maybe via a command line flag? If I should open another ticket for this issue separately, please let me know.
I went ahead and opened a new ticket, see T241149
In the meantime. I checked query output for a 5 page run, and while there's some scary stuff in there, it's all per batch and I'm gonna grit my teeth and ignore for now, with a small enough batch size it won't kill us. Remind me to hit up the dbas later though! The rest checks out so I think we can consider this done. Wanna have the honor of closing, @Cparle ?