Page MenuHomePhabricator

[M] Add 'custommatch' params to commons config for searching media files using wikidata ids
Closed, ResolvedPublic

Description

NOTE: T296814 must be done before this

User story

As a user, I want to be able to get an image suggestion (with a confidence score) for a particular image id.
As a developer, I want to be able to store suggested image data with confidence scores.
To make this possible I need to
a) have appropriate data in the commonswiki search index (T296814 and related tasks)
b) configure commons media search to make use of the new data (this ticket)


Once we have the commons search index weighted_tags field populated with data from wikidata, we need to enable searching for images via wikidata ids. The following will need to be added to commons config

$wgMediaInfoCustomMatchFeature = [
		'depicts_or_linked_from' => [
			'fields' => [
				'statement_keywords' => [
					[ 'prefix' => 'P180=', 'boost' => 0.06800689749434177 ], //depicts
					[ 'prefix' => 'P6243=', 'boost' => 0.0001 ], // digital representation of (arbitrary small value)
				],
				'weighted_tags' => [
					[ 'prefix' => 'image.linked.from.wikidata.p18/', 'boost' => 0.9862993091599952 ],
					[ 'prefix' => 'image.linked.from.wikidata.p373/', 'boost' => 7.190793838918551 ],
					[ 'prefix' => 'image.linked.from.wikidata.sitelink/', 'boost' => 5.031161363459293 ],
				],
			],
			// logistic function
			'functionScore' => [
				'scriptCode' => '100 / ( 1 + exp( -1 * ( _score + intercept ) ) )',
				'params' => [ 'intercept' => -1.3459675572537635 ],
			]
		],
	];

When this is in place, users will be able to search for, for example, images of cats using custommatch:depicts_or_linked_from=Q146

Event Timeline

Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
CBogen renamed this task from Add 'custommatch' params to commons config for searching media files using wikidata ids to [M] Add 'custommatch' params to commons config for searching media files using wikidata ids.Jan 26 2022, 5:32 PM

We've changed our approach to calculating confidence scores, and are now estimating them before storing image suggestions. This ticket is therefore no longer necessary for image suggestions, as we don't have another use case for getting images for a particular Q-id

It might, however, prove to be useful for some other kind of image search in the future, so not closing

Updated boosts from Inspiration Week 2022

$wgMediaInfoCustomMatchFeature = [
		'depicts_or_linked_from' => [
			'fields' => [
				'statement_keywords' => [
					[ 'prefix' => 'P180=', 'boost' => 0.05032411940299784 ], //depicts
					[ 'prefix' => 'P6243=', 'boost' => 0.0001 ], // digital representation of (arbitrary small value)
				],
				'weighted_tags' => [
					[ 'prefix' => 'image.linked.from.wikidata.p18/', 'boost' => 1.984794590275781 ],
					[ 'prefix' => 'image.linked.from.wikidata.p373/', 'boost' => 5.739424158364518 ],
					[ 'prefix' => 'image.linked.from.wikipedia.lead_image/', 'boost' => 3.74983393205065 ],
				],
			],
			// logistic function
			'functionScore' => [
				'scriptCode' => '100 / ( 1 + exp( -1 * ( _score + intercept ) ) )',
				'params' => [ 'intercept' => -0.29925433614892966 ],
			]
		],
	];
Cparle claimed this task.

Deployed and in production

@Cparle Can this feature be documented on-wiki somewhere? I thought it would be useful for a project I am working on, but I'm not sure I fully understand the design.

I understand what I am looking at here: https://commons.wikimedia.org/w/index.php?search=haswbstatement%3AP180%3DQ5296&title=Special:MediaSearch&go=Go&type=image

But I do not understand what the custommatch operator is doing here:
https://commons.wikimedia.org/w/index.php?search=custommatch%3Adepicts_or_linked_from%3DQ5296&title=Special:MediaSearch&go=Go&type=image

There are many images there for which Q5296 is not in a P180/P6243 or linked from the Wikidata item, as far as I can tell. For example this one:

https://commons.wikimedia.org/wiki/File:Al-Ahzab_Battle_map-2.svg

I am trying to understand if this is a bug, or just bad data.

First, some background ...

custommatch:depicts_or_linked_from=Qxxx constructs a query that will return any article:

  • with P180=Qxxx in the statement_keywords field in the article's elasticsearch document
  • with P6243=Qxxx in the statement_keywords field in the article's elasticsearch document
  • with image.linked.from.wikidata.p18/Qxxx in the weighted_tags field in the article's elasticsearch document
  • with image.linked.from.wikidata.p373/Qxxx in the weighted_tags field in the article's elasticsearch document
  • with image.linked.from.wikipedia.lead_image/Qxxx in the weighted_tags field in the article's elasticsearch document

The statement_keywords fields are populated whenever a user edits a File page - so if I add "depicts: cat" to an image on commons P180=Q146 gets added into the search index

The weighted_tags fields get updated once a week by a scheduled script, and are based on snapshots of wikidata and the wikipedias


Looking behind the scenes for your search it seems like it's matching on the weighted_tags field that links an image from the articles it's used as a lead image on ... however when I look at those articles the data doesn't seem to match up, so I think we have a bug. Raised T317138