[EPIC] Experiment with incorporating Image Matching Algorithm data into commons search index
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cparle
	May 27 2021, 8:02 PM

Description

The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons

This is a proposal for an experiment to investigate if doing the latter will give us acceptable results for image-recommendations

ATM the Image Matching Algorithm returns as suggested images for an article

the image for the wikidata item corresponding to the article
images from the commons category associated with the wikidata item corresponding to the article
lead images from articles in other wikis that correspond to the same wikidata item

I propose that we inject the above three pieces of data as property-value pairs into the weighted tags field:

image.linked.from.wikidata.P18 can store any wikidata item for which this image is the primary image
image.linked.from.wikidata.P373 can store the wikidata item for any commons category that the image belongs to
image.linked.from.wikidata.sitelink can store wikidata item ids for any wiki page this image is included on

So for example, this image of the mona lisa will gain the following elements in its weighted_tags array

image.linked.from.wikidata.P18/Q12418
image.linked.from.wikidata.P18/Q13369744
image.linked.from.wikidata.sitelink/Q12418

The experiment:

Create a new search index (on relforge) based on a commons dump, and enhanced with the data described above T286562
Incorporate searching for the new data Search/ASTQueryBuiler/WordsQueryNodeHandler.php or its successor, using a new search profile T286563
Tune the new search profile based on a training dataset extracted from our labeled data
Calculate a balanced accuracy score for the labeled data we already have from the existing image-recommendations API via image-recommendation-test
Run equivalent searches on the new search index, and calculate the balanced accuracy score for that
Iterate on tuning if necessary

Expected outcome:

If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, then we can make a case for injecting the new data into the production commons indices (probably using events), and using the new search profile in production

Related Objects
Search...

Status	Assigned	Task
Resolved	Cparle	T283869 [EPIC] Experiment with incorporating Image Matching Algorithm data into commons search index
Resolved	Cparle	T286562 [XL] Create new index on relforge incorporating Image Matching Algorithm data
Resolved	Cparle	T286563 [M] Create new search profile for commons that uses weighted_tags
Resolved	Cparle	T286565 [L] Compare accuracy of MediaSearch using the new weighted_fields data to data returned from Image Matching Algorithm

Event Timeline

Cparle created this task.May 27 2021, 8:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2021, 8:02 PM

Cparle updated the task description. (Show Details)May 27 2021, 8:18 PM

dcausse subscribed.Jun 8 2021, 2:57 PM

Cparle updated the task description. (Show Details)Jun 8 2021, 4:09 PM

Cparle updated the task description. (Show Details)

EBernhardson subscribed.Jun 8 2021, 5:34 PM

Cparle renamed this task from Proposal to incorporate Image Matching Algorithm data into commons search index to Experiment with incorporating Image Matching Algorithm data into commons search index.Jun 11 2021, 10:58 AM

Cparle updated the task description. (Show Details)

Cparle updated the task description. (Show Details)Jun 11 2021, 4:54 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.Jun 14 2021, 4:49 PM

CBogen moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.Jun 14 2021, 5:04 PM

CBogen changed the subtype of this task from "Spike" to "Task".Jun 30 2021, 4:41 PM

Cparle updated the task description. (Show Details)Jul 13 2021, 2:02 PM

Cparle updated the task description. (Show Details)Jul 13 2021, 2:27 PM

CBogen renamed this task from Experiment with incorporating Image Matching Algorithm data into commons search index to [EPIC] Experiment with incorporating Image Matching Algorithm data into commons search index.Jul 14 2021, 4:08 PM

CBogen moved this task from Ready for Estimation to Epics on the Structured-Data-Backlog (Current Work) board.

CBogen added a project: Epic.

Miriam mentioned this in T287583: Research Support for Image Suggestion Algorithm Deployment .Jul 28 2021, 2:00 PM

dcausse updated the task description. (Show Details)Sep 2 2021, 3:58 PM

CBogen mentioned this in T283865: [XL] Estimate coverage of image suggestions at different confidence levels.Oct 4 2021, 5:00 PM

Cparle updated the task description. (Show Details)Oct 8 2021, 4:45 PM

Cparle closed subtask T286562: [XL] Create new index on relforge incorporating Image Matching Algorithm data as Resolved.Oct 8 2021, 4:54 PM

@dcausse and @EBernhardson: as discussed at our meeting earlier today, please let us know when you've been able to investigate whether there's enough space for this bigger index. Please post the results in this ticket. Thanks!

Cparle closed subtask T286565: [L] Compare accuracy of MediaSearch using the new weighted_fields data to data returned from Image Matching Algorithm as Resolved.Dec 21 2021, 12:51 PM

Cparle closed subtask T286563: [M] Create new search profile for commons that uses weighted_tags as Resolved.Jan 7 2022, 5:05 PM

mfossati mentioned this in T299343: Requesting access to analytics clients for mfossati.Jan 17 2022, 12:34 PM

mfossati subscribed.Jan 17 2022, 1:56 PM

CBogen added a subtask: T296814: [EPIC] Article-level image suggestions data pipeline.Jan 18 2022, 3:23 PM

@CBogen the "experiment" part of this is done, and we've moved onto production-izing it. Should we close this epic? Do we need another one to cover the production-ization?

In T283869#7631684, @Cparle wrote:

@CBogen the "experiment" part of this is done, and we've moved onto production-izing it. Should we close this epic? Do we need another one to cover the production-ization?

@Cparle I think a new epic about productionizing this work would be great, do you mind creating one? Then we can move the production-related tickets over from this epic to that one and close this out.

Cparle mentioned this in T299781: [EPIC] Image suggestions backend .Jan 21 2022, 5:15 PM

Cparle removed a subtask: T296814: [EPIC] Article-level image suggestions data pipeline.Jan 21 2022, 6:12 PM

Epic to productionize this work T299781

Experimental stage is complete, so closing this epic

[EPIC] Experiment with incorporating Image Matching Algorithm data into commons search indexClosed, ResolvedPublicActions