Page MenuHomePhabricator

dumpRdf does not work with MediaInfo entities
Closed, ResolvedPublic

Description

dumpRdf uses SqlEntityIdPager to fetch IDs by namespace linked to the entity type, and then does this:

$position = (int)$row->page_id;
$entityIds[] = $this->entityIdParser->parse( $row->page_title );

This of course can not work for MediaInfo type entities since the title is not the ID in this case. I think we need a more advanced procedure there, since now page with same title can have more than one slot, and depending on the requested entity type(s) one or more of these slots may be entity IDs we're looking for.

Details

Related Gerrit Patches:
mediawiki/extensions/WikibaseMediaInfo : masterAdd handler allowing to entityIdLookup file pages
mediawiki/extensions/Wikibase : masterMake RDF dump use EntityIdLookup
mediawiki/extensions/Wikibase : masterEnable EntityByLinkedTitleLookup resolve local links

Event Timeline

Smalyshev created this task.May 1 2019, 7:43 PM
Smalyshev removed Smalyshev as the assignee of this task.May 1 2019, 7:57 PM
Smalyshev triaged this task as Normal priority.May 1 2019, 8:09 PM
Smalyshev added a comment.EditedMay 1 2019, 8:17 PM

Looks like mediainfo type is mapped to namespace 6 (NS_FILE), but the real namespace is 144, where we get entities with right titles. Maybe config problem?

Smalyshev edited subscribers, added: matthiasmullie; removed: MB-one.

MediaInfo sets namespace as:

	public static function onWikibaseRepoEntityNamespaces( &$entityNamespacesSetting ) {
		// Tell Wikibase where to put our entity content.
		$entityNamespacesSetting[ MediaInfo::ENTITY_TYPE ] = NS_FILE . '/' . MediaInfo::ENTITY_TYPE;
	}

so this probably is where NS_FILE is coming from. But I am not sure how to translate from this to namespace 144.

daniel added a comment.May 1 2019, 8:50 PM

NS 6 is correct, MediaInfo entities are stored in the File namespace on commons.

To map from Titles to IDs, use and EntityIdLookup.

(I'm a bit out of the loop, but unless something changed radically in the last 6 months or so, what I wrote above should still be true)

Yup, "the real namespace" is "6/mediainfo".

EntityIdLookup has:

	public function getEntityIdForTitle( Title $title );

but I am not sure how it's supposed to work since it does not have slot ID or anything like that as a parameter... And the main implementation does:

		$contentModel = $title->getContentModel();

		try {
			$handler = $this->getEntityHandlerForContentModel( $contentModel );
			return $handler->getIdForTitle( $title );

which would probably not work since the content model for File pages is wikitext.

Looking further into this, we can check if the page has the slot with this:

  1. Create WikiPage from page ID/title
  2. $revisionRecord = $page->getRevisionRecord()
  3. $mediaInfoSlot = $revisionRecord ->getSlot( MediaInfo::ENTITY_TYPE )

But it is still unclear what one would use to obtain MediaInfoId. The scheme is that MediaInfoId is derived from page ID, but I am not sure if there's a generic handler that implements it, or it's all ad-hoc.

Additional complication is that SqlEntityIdPager is supposed to work with multiple entity Ids in batches, and if we fetch from namespace like File where some pages may not have slots that we're looking for then both composing the query to get both non-slot and slot-based IDs, and the batching of those, may become a bit complicated...

I think EntityByLinkedTitleLookup can be used for this, with slight modification that enables SiteLinkTable to produce links for local Wikidata without going to the database. Also, SiteLinkTable for some reason only produces item IDs. Maybe that should be changed too?

Smalyshev added a subscriber: WMDE-leszek.
Cparle added a comment.May 2 2019, 2:44 PM

@Smalyshev you can get the MediaInfo id from the page id using an entity id composer:

$entityId = WikibaseRepo::getDefaultInstance()->getEntityIdComposer()->composeEntityId(
			'',
			MediaInfo::ENTITY_TYPE,
			$pageId
		)->getSerialization();

@Cparle Yes, I know, but this only works for MediaInfo. dumpRdf code does not know that particular page contains a MediaInfo slot (or that such thing as MediaInfo exists at all). That's the whole problem.

daniel added a comment.May 2 2019, 5:34 PM

There is MediaInfoIdLookup which does what you need, I think. But the question remains - how would dumpRdf know to use it?

I think the answer will be that MediaInfo will need to hook into the EntityIdLookup service in a way that allows this to work. Or into EntityByLinkedTitleLookup. The two are pretty similar, though EntityIdLookup is intended to be based on the page that *stores* the entity, while EntityByLinkedTitleLookup is intended to be based on the title of the page that is "connected" to the entity (normally via sitelinks). File pages are both, really: the entity is stored there, and it's also the page for which the entity is the default for data access. So ideally, both EntityIdLookup and EntityByLinkedTitleLookup would be configured to do the right thing for MediaInfo.

But the question remains - how would dumpRdf know to use it?

Exactly. The question is not how to get MediaInfoId from page id - it's easy. The question is how dumpRdf would even know mediainfo is there on the page and know that it needs to use special code for that specific purpose?

will need to hook into the EntityIdLookup service

But EntityIdLookup does not know anything about slots. How would it know to look into 'mediainfo' slot on the page? Should we add a method to this API? Or a parameter?

BTW I am not sure how MediaInfoIdLookup is supposed to work:

		if ( !$title->inNamespace( $this->mediaInfoNamespace ) ) {
			return null;
		}

		try {
			return new MediaInfoId( $title->getText() );
		} catch ( InvalidArgumentException $ex ) {
			return null;
		}

The file pages are in NS_FILE namespace and generating MediaInfoId from their title is sure to fail. And there's no separate namespace for MediaInfo entities anymore, as I understand. So what this code actually is supposed to do? @daniel, @WMDE-leszek - is this relevant anymore?

daniel added a comment.May 2 2019, 9:49 PM

BTW I am not sure how MediaInfoIdLookup is supposed to work:

Oh right, that implementation was only for the standalone MediaInfo. Sorry about that.

The file pages are in NS_FILE namespace and generating MediaInfoId from their title is sure to fail.

Actually, the MediaInfoId can be generated from the Title object. It's based on the page ID. No access to the slot is needed.

The implication is however that there can only be one entity per page, in any slot. That's fine for now, but a bit annoying and arbitrary. One might expect that multiple entities could reside in different slots. Not to speak of "sub-entities".

Actually, the MediaInfoId can be generated from the Title object. It's based on the page ID. No access to the slot is needed.

For generating ID, sure. But we don't know if the particular file page has the slot, and if it does not, we shouldn't generate the ID, probably. Also, as you noticed, what if there's ever more than one?

I could hack together temporary implementation based on EntityByLinkedTitleLookup, which would rely on MediaInfo hook and assume only MediaInfo uses slots, and does it in a very particular way it does now, but we probably need to look into more comprehensive solution as to how to enumerate such things.

Change 507900 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Enable EntityByLinkedTitleLookup resolve local links

https://gerrit.wikimedia.org/r/507900

Smalyshev added a comment.EditedMay 3 2019, 12:59 AM

With the patch above, it should be possible to use EntityByLinkedTitleLookup in the rdf loop and this should also produce mediainfo IDs.

But there's another problem - EntityByLinkedTitleLookup only accepts title, and MediaInfo needs page ID. So it would generate database lookup for each item, which would kill performance. Looks like we need something like EntityIdLookup API, where we can work with Title, but somehow enable MediaInfo to override it.

Change 507910 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Make RDF dump use EntityIdLookup

https://gerrit.wikimedia.org/r/507910

Change 507912 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/WikibaseMediaInfo@master] Add handler allowing to entityIdLookup file pages

https://gerrit.wikimedia.org/r/507912

I tried to generate dump with two patches above and it seems to work sanely except for two things:

  1. Lot of complaints for non-existing entities (since not all NS_FILEs have mediainfo slots). Easy way to fix it is to shut up warnings. Harder but more correct way is to add query filter which would check for existance of required slot.
  1. T222497: dumpRDF for MediaInfo entities loads each page individually

Change 507910 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Make RDF dump use EntityIdLookup

https://gerrit.wikimedia.org/r/507910

Change 507912 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Add handler allowing to entityIdLookup file pages

https://gerrit.wikimedia.org/r/507912

Smalyshev moved this task from In review to Done on the User-Smalyshev board.Jun 8 2019, 3:00 AM
Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:25 PM
Smalyshev moved this task from Done to Doing on the User-Smalyshev board.Jun 26 2019, 6:04 AM
Smalyshev moved this task from Doing to Done on the User-Smalyshev board.Jun 26 2019, 6:15 PM
Smalyshev closed this task as Resolved.Jun 26 2019, 6:18 PM