Page MenuHomePhabricator

dumpRDF for MediaInfo entities loads each page individually
Closed, ResolvedPublic

Description

When dumping the entities to RDF, we try to load the data in batches, so that we do not call the database for each entity separately. However, this does not work for MediaInfo, since it has this:

	public function getTitleForId( EntityId $id ) {
		return Title::newFromID( $id->getNumericId() );
	}

Which produces database fetch for every entity ID. Backtrace leading to this is:

frame #0: Title::newFromID(id=13) at /var/www/w/includes/Title.php:471
frame #1: Wikibase\MediaInfo\Content\MediaInfoHandler->getTitleForId(id=Wikibase\MediaInfo\DataModel\MediaInfoId) at /var/www/w/extensions/WikibaseMediaInfo/src/Content/MediaInfoHandler.php:187
frame #2: Wikibase\Repo\Content\EntityContentFactory->getTitleForId(id=Wikibase\MediaInfo\DataModel\MediaInfoId) at /var/www/w/extensions/Wikibase/repo/includes/Content/EntityContentFactory.php:149
 => array_map(callback=array(2), array(9)) (internal function)
frame #3: Wikibase\Dumpers\RdfDumpGenerator->preBatchDump(entities=array(9)) at /var/www/w/extensions/Wikibase/repo/includes/Dumpers/RdfDumpGenerator.php:118
frame #4: Wikibase\Dumpers\DumpGenerator->dumpEntities(entityIds=array(9), dumpCount=unknown type: 10) at /var/www/w/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php:295
frame #5: Wikibase\Dumpers\DumpGenerator->generateDump(idPager=Wikibase\Repo\Store\Sql\SqlEntityIdPager) at /var/www/w/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php:262

Probably we may need some kind of batching strategy do deal with it? It's different from other entities because other entities can create Title from EntityId without DB access. Ironically, the code that uses Titles - PageProps::getProperties - actually needs IDs, so it immediately converts Titles to IDs. Moreover, I'm not even sure page properties that we dump even exist for NS_FILE+MediaInfo entities. So we're wasting a lot of DB queries for nothing.

Related Objects

Event Timeline

Smalyshev triaged this task as Medium priority.May 3 2019, 11:42 PM
Smalyshev created this task.

Some batching could be added to RdfDumpGenerator::preBatchDump rather than just doing the array_map there with Title::getTitleForId.

It looks like the ns and title are needed for this LinkBatch bit in PageProps, why that is needed there is another question.
LinkBatch itself would actually be able to do its query (currently in LinkBatch::doQuery) without ns and title, instead just using pageid.
T208776 would generally help this situation.

No idea if NS_FILE+MediaInfo have page props that we actually dump either, if they do not, then we could probably implement something to skip this step for entities that we know will not have page props that we want to dump.

I think we shouldn't wait for T208776 to fix this. Skipping the whole page props thing may work, though I am not sure how to implement the condition yet. Or maybe we need to change these APIs so they don't require more than they need - i.e. if the function needs only page ID, it should receive page ID, not require the whole Title.

We could probably set $pagePropertiesRdf to empty to skip page props, but this one may actually be also relevant for Commons:

'wb-claims' => [ 'name' => 'statements', 'type' => 'integer' ],

Change 522236 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] When fetching data for pager, add it to the link cache

https://gerrit.wikimedia.org/r/522236

Biggest issue here is that we need generic solution for both Item and MediaInfo, when:

  • Item has EntityId derived from title, but we do not know page id (though could get it from LinkCache). So having EntityId, we know title, but not page id.
  • MediaInfo has EntityId derived from page id, but we do not know title.
  • PageProps have API based on Title[]
  • Even if we modify PageProps API to base on page ID, Entity does not have API that returns page id from EntityId, only one that looks up whole Title by EntityId.
  • We actually load whole information in SqlEntityIdPager - but while LinkCache can cache title=>page info, there seems to be no cache that is keyed by page id. So again, for MediaInfo, where we only have page id, we're in trouble.

Change 522236 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] When fetching data for pager, add it to the link cache

https://gerrit.wikimedia.org/r/522236

The link cache is for the title -> id lookup. Seems to me like you need the opposite here. The batch interface would be Title::newFromIDs (note the plural), but that does no caching.

Is there a concrete ask for the CPT team here?

Change 548733 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] Add getTitlesForIds() method to handler

https://gerrit.wikimedia.org/r/548733

Change 548734 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/Wikibase@master] Batch entity title lookups

https://gerrit.wikimedia.org/r/548734

Change 548733 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Add getTitlesForIds() method to handler

https://gerrit.wikimedia.org/r/548733

Change 548734 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Batch entity title lookups

https://gerrit.wikimedia.org/r/548734

Merged, so resolving