Page MenuHomePhabricator

Create 'connectedtowikidataid' feature for mediasearch
Closed, ResolvedPublic

Description

We have a bunch of new data in an experimental search index for commons (see T286562)

First attempt to improve search by using the new data hasn't worked (see https://phabricator.wikimedia.org/T286565#7523908)

The new data works in a very similar way to the image matching algorithm, so I think we ought to be able to do better

Let's create a new search feature connectedtowikidataid so that you can search directly for images that are related to a wikidata item via one of the new fields or via statements e.g. by searching for connectedtowikidataid:Q144

Once it's done run the analysis script on it and see how it performs, and if possible compare to how the IMA itself performs (calling the IMA via https://image-suggestion-api.wmcloud.org/image-suggestions/v0/<product>/<language>/pages/<searchTerm>?source=ima)

Event Timeline

Cparle renamed this task from Create 'connectedtowikidataid' feature for search to Create 'connectedtowikidataid' feature for mediasearch.Nov 23 2021, 4:20 PM

I tuned the boosts using logistic regression in the usual way (having first added all the labeled IMA data from the image-recommendation-test project)

Tuning params as follows

$wgMediaInfoConnectedToWikidataIdFeature = [
		'statement_keywords' => [
			'P180=' => 0.06800689749434177, //depicts
			'P6243=' => 0.0001, // digital representation of (arbitrary small value)
		],
		'weighted_tags' => [
			'image.linked.from.wikidata.p18/' => 1000 * 0.9862993091599952,
			'image.linked.from.wikidata.p373/' => 7.190793838918551,
			'image.linked.from.wikidata.sitelinks/' => 5.031161363459293,
		],
		'_logisticFunction' => [
			'enabled' => true,
			'intercept' => -1.3459675572537635,
		],
	];

Results of analysis, using only labeled data that we gathered from the IMA so that IMA results are comparable with mediasearch results

IMA

F1 Score0.6865671641791
Precision@11
Precision@30.99459459459459
Precision@100.99459459459459
Precision@250.99459459459459
Precision@500.99459459459459
Precision@1000.99459459459459
Recall0.52421652421652
Average precision0.52404712404712

Mediasearch with connectedtowikidataid

F1 Score0.74586466165414
Precision@11
Precision@31
Precision@101
Precision@250.9957264957265
Precision@500.99588477366255
Precision@1000.9919028340081
Recall0.59759036144578
Average precision0.59737071687852

So the upshot of all this is

  • connectedtowikidataid has *very slightly better* precision than the IMA on its own
  • connectedtowikidataid has *substantially better* recall than the IMA on its own
  • we should be able to drastically improve the precision of mediasearch by incorporating this

WIP patch, I had forgotten to put in the bug number in the commit msg so it hasn't shown up automatically

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/740861/1

Here's a comparison of calling the IMA with Qxxx with searching via connectedtowikidataid for Qxxx

Comparing only labeled results tagged with IMA, because the IMA only contains suggestions for unillustrated pages, so to get an accurate comparison we need to use only pages we know the IMA can return a result for

IMA

F1 Score0.6865671641791
Precision@11
Precision@30.99459459459459
Precision@100.99459459459459
Precision@250.99459459459459
Precision@500.99459459459459
Precision@1000.99459459459459
Recall0.52421652421652
Average precision0.52404712404712

connectedtowikidataid

F1 Score0.74586466165414
Precision@11
Precision@31
Precision@101
Precision@250.9957264957265
Precision@500.99588477366255
Precision@1000.9919028340081
Recall0.59759036144578
Average precision0.59737071687852

SO searching via connectedtowikidataid seems to be at least as good as using the IMA directly

Change 740861 had a related patch set uploaded (by Matthias Mullie; author: Cparle):

[mediawiki/extensions/WikibaseMediaInfo@master] 'custommatch' feature

https://gerrit.wikimedia.org/r/740861

Change 740861 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] 'custommatch' feature

https://gerrit.wikimedia.org/r/740861

This actually can't be QAd until T296814 and T298684 are done, so resolving and will QA as part of T298684