Page MenuHomePhabricator

dumpRdf for mediainfo entities loads data from db more often than it needs to
Closed, ResolvedPublic

Description

Having batched Title-from-EntityId lookups for mediainfo entities to reduce the number of queries during dumps (see T222497), it turns out that the MediaInfo code still runs Title:newFromId() twice for each entity being dumped. It only gets run once for Items. Needs to be fixed

Related Objects

Event Timeline

Change 556753 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] Work in progress

https://gerrit.wikimedia.org/r/556753

Change 556753 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Prevent duplicate calls to Title:newFromID during dumps

https://gerrit.wikimedia.org/r/556753

This should hit production this week. Ariel, please let Cormac know directly if it doesn't work 😄

This should hit production this week. Ariel, please let Cormac know directly if it doesn't work 😄

Sure will. I plan to test in beta early this week, before it gets out to all the wikis.

Doing some initial testing on beta. as the dumpsgen user from snapshot01.

For one output file with one shard:

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --shard 1 --sharding-factor 2 --batch-size 70000 --format ttl --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --part-id '1-2' 1 70000  2>> /var/lib/dumpsgen/mediainfo-dumps-log-ttl.txt | gzip > /mnt/dumpsdata/xmldatadumps/temp/mediainfo-dumps-test-ttl.gz

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --shard 1 --sharding-factor 2 --batch-size 70000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump --part-id '1-2' 1 70000  2>> /var/lib/dumpsgen/mediainfo-dumps-log-nt.txt | gzip > /mnt/dumpsdata/xmldatadumps/temp/mediainfo-dumps-test-nt.gz

Haven't checked query execution yet but I did notice one thing: for pages (in namespace 6) with no mediainfo slot, an error message is logged of he form:
"[failed-to-dump]: Failed to dump M70620 (Entity not found: M70620)"

Given that most pages on commons don't have mediainfo slots, this is going to be 50 million log entries. Can we silence the specific entry somehow, maybe via a command line flag? If I should open another ticket for this issue separately, please let me know.

I went ahead and opened a new ticket, see T241149

In the meantime. I checked query output for a 5 page run, and while there's some scary stuff in there, it's all per batch and I'm gonna grit my teeth and ignore for now, with a small enough batch size it won't kill us. Remind me to hit up the dbas later though! The rest checks out so I think we can consider this done. Wanna have the honor of closing, @Cparle ?