Calculate image suggestions confidence score without using elasticsearch
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cparle
	Feb 14 2022, 4:39 PM

Description

For the current iteration of image suggestions we have a tuned search profile where the elasticsearch score returned reflects the likelihood that an image is a good match, and we had anticipated using this as a confidence score

While generating the image suggestions data we gather data from wikidata and save it in hdfs so that it can be picked up by the search pipeline and imported into the commonswiki search index. This data is essential for calculating the confidence score ... however, we can't actually get a confidence score until the data is in the index, and therefore we're unable to finish generating the suggestions data until we're sure the data has been imported

In order to work around that, this ticket is to calculate the confidence score before the data is available in elasticsearch. Only 1 of the 4 signals used to calculate the score is bm25-based, so it should be possible

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T299781 [EPIC] Image suggestions backend
Resolved	mfossati	T296814 [EPIC] Article-level image suggestions data pipeline
Resolved	Cparle	T299789 [XL] Store a list of unillustrated articles with suggested images in hdfs
Resolved	Cparle	T301687 Calculate image suggestions confidence score without using elasticsearch

Event Timeline

Cparle created this task.Feb 14 2022, 4:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2022, 4:39 PM

Cparle added a parent task: T299789: [XL] Store a list of unillustrated articles with suggested images in hdfs.Feb 14 2022, 4:50 PM

CBogen assigned this task to Cparle.Feb 14 2022, 5:37 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

CBogen moved this task from Incoming to Doing on the Structured-Data-Backlog (Current Work) board.

After running queries on the labeled data, it turns out the most reliable confidence score is simply based on the source of the match

source of match	proportion of good images
P18	0.9787234043
lead image	0.8839907193
commons category	0.8734693878
depicts	0.7577433628
no match	0.3863076923
overall	0.4854083314

So rounding down for safety we're gonna say

If we match an image based on P18, confidence score is 90%
If we match an image based on lead image or commons category, confidence score is 80%
If we match an image based on depicts, confidence score is 70%

See here for the full results/analysis https://docs.google.com/spreadsheets/d/1ZByYvEnwJyK4GwQ7fgreiJhobtueNqb8UgNxRU1t0Y4/edit#gid=1248021027

Cparle closed this task as Resolved.Feb 23 2022, 6:45 PM

Calculate image suggestions confidence score without using elasticsearchClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Calculate image suggestions confidence score without using elasticsearch
Closed, ResolvedPublic
Actions

Related Objects
Search...