[L] Gather labeled data relevant to synonyms
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Cparle
	Oct 20 2021, 11:39 AM

Description

Adding synonyms to MediaSearch (see T258053 [1]) has greatly improved recall (i.e. the number of relevant results that get returned) for non-English languages. For example searching in Irish for "ialtóg" (bat) without synonyms gives 331 results, while search with synonyms gives ~3800 results.

Our existing labeled data shows a minor bump in search performance when synonyms search is included, but because the labeled data is mostly for English search terms it's unlikely to capture the big difference synonyms make to non-English searches. We'd like to capture the improved recall in our labeled data - maybe a few thousand query/image/rating non-English datapoints. The simplest way to do this is:

do a search with https://commons.wikimedia.org/w/index.php?search=YOUR_SEARCH_TERM&ns6=1&uselang=YOUR_LANGUAGE&mediasearch_synonyms
copy/paste the urls of some good/bad matches (ignore indifferent, they're not v useful) into https://media-search-signal-test.toolforge.org/bulk.html
tag your data with synonyms

Not sure how to decide on which search terms to use:

perhaps use https://trends.google.com/trends/yis/2021/ to find popular ones
Google image search queries (more specific, sort by top instead of rising)

~~[1] There's currently a problem with response times with synonyms (see T293106), but let's ignore that for the purposes of this ticket~~

Related Objects

Mentioned In: T277905: [EPIC] Put decisions about mediasearch improvements on a sounder experimental footing
Mentioned Here: T280368: [XL] Get a sense of how much labeled data we need by plotting learning curves
T258053: [L] Use English labels/aliases of matching wikidata entities to expand the wikitext, title, caption & categories search terms
T293106: MediaSearch with synonyms turned on sometimes takes MUCH longer than without

Event Timeline

Cparle created this task.Oct 20 2021, 11:39 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2021, 11:39 AM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.Nov 1 2021, 4:36 PM

CBogen moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.

Cparle updated the task description. (Show Details)Nov 1 2021, 5:46 PM

Cparle updated the task description. (Show Details)

CBogen renamed this task from Gather labeled data relevant to synonyms to [L] Gather labeled data relevant to synonyms.Nov 17 2021, 5:37 PM

CBogen moved this task from Ready for Estimation to Ready for Development on the Structured-Data-Backlog (Current Work) board.

SimoneThisDot moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.Nov 22 2021, 2:01 PM

SimoneThisDot moved this task from Doing to Ready for Development on the Structured-Data-Backlog (Current Work) board.

mfossati subscribed.Jan 17 2022, 3:42 PM

mfossati updated the task description. (Show Details)Jan 19 2022, 10:38 AM

mfossati updated the task description. (Show Details)Jan 19 2022, 10:54 AM

Cparle mentioned this in T277905: [EPIC] Put decisions about mediasearch improvements on a sounder experimental footing.Jan 19 2022, 11:17 AM

Cparle updated the task description. (Show Details)Jan 19 2022, 3:59 PM

@Miriam and I brainstormed on the method and design of this task. Here are our general thoughts in order of importance:

binary relevance judgments should be elicited, i.e., relevant / not relevant
we need at least 3 judgments per (query, image) pair. This is the minimum requirement to compute agreement, thus making a robust ground truth
the task interface should display a grid of images and the contributor should click on relevant ones
the real-world click-through rate per (query, image) pair could be computed from the Commons search logs, based on a time span, a number of users per query, and a number of clicks per image. For instance, given the query Joey Ramone, this image was clicked 42 times by 77 users in 3 months.

And here's what we propose:

INPUT = production Commons search logs
sample queries based on traffic to get a mix of popular and rare ones
for each query, sample the top K results to get a mix of likely positive and negative samples
for each query, display the sampled results in a grid interface
let contributors click on the relevant ones
ensure 3 judgments per (query, image)) pair

We can adapt the currently available task interface at https://media-search-signal-test.toolforge.org/ (note that I just hacked it to make it work).

This is great @mfossati and @Miriam. I have 2 questions

First - atm we have only 1 judgement per query/image pair. I realise that there's a trade-off between having a large-enough representative sample of images and making sure the ground-truth is robust. From the work we've done in T280368 it looks like the sampled data we have currrently is representative enough, so maybe it's worthwhile improving the robustness of our ground truth ... but this ticket is about gathering labeled data specifically to capture the effect of including synonyms on non-English searches. The number of judgements we can expect to get is limited, so are you guys sure that we wouldn't be better off sticking with 1 judgement per pair just to make sure our sample of synonym-relevant data is large enough?

Second - I'm not sure I understand exactly how you're proposing to get the data from search logs. Are there search logs with this info on superset perhaps?

... oh, and one other. With grid view where the user clicks on relevant images, are we going to assume that if an image is not clicked on it's not relevant?

In T293878#7636822, @Cparle wrote:

this ticket is about gathering labeled data specifically to capture the effect of including synonyms on non-English searches. The number of judgements we can expect to get is limited, so are you guys sure that we wouldn't be better off sticking with 1 judgement per pair just to make sure our sample of synonym-relevant data is large enough?

I totally understand your concern, we should probably start collecting 1 judgment and cover a fair volume of data. Once we have an idea of the cost, we can then estimate scaling up to more judgments.

Second - I'm not sure I understand exactly how you're proposing to get the data from search logs. Are there search logs with this info on superset perhaps?

No idea, just assuming Commons search logs live somewhere. Do you have any pointers or relevant people we can talk to?

In T293878#7636900, @Cparle wrote:

... oh, and one other. With grid view where the user clicks on relevant images, are we going to assume that if an image is not clicked on it's not relevant?

Exactly. We should provide crystal-clear instructions, by the way.

No idea, just assuming Commons search logs live somewhere. Do you have any pointers or relevant people we can talk to?

@EBernhardson you might be able to help out here

In T293878#7645126, @Cparle wrote:

No idea, just assuming Commons search logs live somewhere. Do you have any pointers or relevant people we can talk to?

@EBernhardson you might be able to help out here

Depends what you need. If you need the queries that were sent, we generally have those along with the results that were returned unsampled going back ~90 days in event.mediawiki_cirrussearch_request. Getting that kind of information from backend queries is quite tedious though, you need to devise ways to separate the requests you are interested in from everything else. We instrumented Special:Search and the skin autocomplete some years ago to more directly collect information about interfaces we are interested in, but as far as I'm aware none of that was ported to Special:MediaSearch

Thanks for the early feedback, @EBernhardson.

In T293878#7645621, @EBernhardson wrote:

If you need the queries that were sent

Exactly.

we generally have those along with the results that were returned unsampled going back ~90 days in event.mediawiki_cirrussearch_request.

Nice, I've just had a quick look at it.

Getting that kind of information from backend queries is quite tedious though, you need to devise ways to separate the requests you are interested in from everything else.

Do you have any specific advice on this, apart from digging into the http column?

mfossati claimed this task.Jan 31 2022, 4:20 PM

In T293878#7663943, @mfossati wrote:

Thanks for the early feedback, @EBernhardson.

In T293878#7645621, @EBernhardson wrote:

Getting that kind of information from backend queries is quite tedious though, you need to devise ways to separate the requests you are interested in from everything else.

Do you have any specific advice on this, apart from digging into the http column?

The problem is that backend doesn't know where anything comes from and various bots and UI's use generally the same api calls. However you separate things needs to be specific to the use case at hand, evaluating in what ways the desired requests look different from everything else. Sometimes specific UI's can be targeted through the http url if they have a distinctive query pattern, but that's not always the case. In the past this was a highly iterative process, evaluating the events that come out and figuring out if it's representative. Typically this dataset also needs to be filtered for bot activity, in the past only perhaps 25% of full text search requests came from the UI. We've often used rough heuristics like sampling one requests per ip address, or filtering ips with > n requests per day. This can still be ineffective with some modern bots running from a wide range of cloud ip addresses, it all depends. A slightly more intensive, but reasonably effective, method is to also filter ip's that rarely visit index.php (but this requires parsing webrequests, and filters mobile apps).

mfossati changed the task status from Open to In Progress.Feb 22 2022, 9:12 AM

mfossati moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

Under a completely different perspective, we may leverage the Wikidata statements ranking system as readily available ground truth. This would ideally enable automatic collection of labeled data.
For instance, if some p18 statement is ranked as deprecated, that would cater for a negative sample, and vice versa for preferred statements.

Caveat: this SPARQL query shows some items having p18 deprecated statements: https://w.wiki/4oTf. If we look at them with the image search use case in mind, it's not evident why they were deprecated.
Counting all of them times out in the public SPARQL endpoint (https://w.wiki/4oTg), so we may dive into the data lake again to compute the total.

As a side note, here’s the list of Wikidata properties holding Commons media: https://www.wikidata.org/wiki/Special:ListProperties/commonsMedia.
Caveat: it looks like there’s no specific data type for images, just media in general.

Another source of ground truth might be images that were added then reverted within e.g. a day?

Progress tracked at https://github.com/marfox/image-search.

Script that gathers a dataset of Wikidata ranked images as per T293878#7727270 at https://github.com/marfox/image-search/blob/main/gather_wikidata_ranked_images.py

Implementation pointers:

data collection code at https://github.com/marfox/image-search
frontend code at https://github.com/marfox/media-search-signal-test/tree/T293878
task interface at https://media-search-signal-test.toolforge.org/synonyms.html

Recap:

labeled data gathering task design implemented as per T293878#7635985
ensuring 3 judgments per (query, score) pair is expensive. The task currently supports 1, and we can scale up after a first round
query terms are collected from production Commons logs, following suggestions in T293878#7665381
a best effort to filter NSFW terms is in place, although we should expect to still see some
we will start eliciting judgments internally, but we envision to expand to a broader audience

NOTE: once labeled, we believe that the dataset can be a useful resource to evaluate image search engines in general. It would be great to release it under an open access license, along with a report to be submitted to relevant research venues. This can have a positive impact for researchers and practitioners.

Marking as resolved: the implementation is done
moving to blocked until the labeling task reaches a satisfactory amount of judgments

Marking as resolved seems to make this task invisible in the workboard, so switching back.

Cparle moved this task from Blocked to Code Review on the Structured-Data-Backlog (Current Work) board.Apr 4 2022, 3:50 PM

Update on this ticket - looking at the data I'm not sure that what we've gathered is capturing the effect of the synonyms patch, and I think we might need to curate it more carefully.

Probably ought to pause further work on this ticket until we have a chance to consider this more closely

In T293878#7920273, @Cparle wrote:

Update on this ticket - looking at the data I'm not sure that what we've gathered is capturing the effect of the synonyms patch, and I think we might need to curate it more carefully.

Probably ought to pause further work on this ticket until we have a chance to consider this more closely

@Cparle can you say more about why it's not capturing the effect and what curation you think we need?

Marco's sampled the search terms from the logs based on a mixture of popularity and random, but just looking at the sampled search terms for French, for example, very few of them match up with wikidata labels and therefore won't have any synonyms ... and seeing that the point of this exercise is to capture the effect of the synonyms patch, we've probably been barking up the wrong tree.

A better approach might be to pick our search terms from wikidata labels in the language we're interested in. Probably the wikidata items ought to be connected to real wiki articles so we don't end up searching for very obscure things like An07g10070 (a gene), and we'll probably need some other criteria too like skipping items with the same label in all languages or items that are instances of a Wikimedia category or a scholarly article etc etc

As a result of the conversation with @Miriam (see T293878#7635985 and T293878#7825473), we agreed to raise the bar and opt for a general-purpose task design: the main goal was to collect a real-world dataset that would enable the evaluation of multilingual image search engines in general.
We believe this can have a significant impact on the broad research community, so we ended up in a much more ambitious task.

While I agree that the current dataset is not really for comparing a system with synonyms VS a system without, I still think it can cater for Commons search evaluation with the synonyms feature included.
The key reason is: it's a random sample of Commons production logs, so real queries made by real users.

For the sake of this task, I agree that we should fine-tune the data collection part, at the cost of less randomness and more evaluation bias.
This shouldn't require much effort: the task is there, we just have to switch the data behind. An easy step would be to look into the full dataset again (not the current sample displayed by the task, this is a small portion), then come up with a more eloquent slice, rather than a random one.

Still, my personal feeling is that we should target the overall Commons search system effectiveness for users, rather than focusing on eventual recall changes due to the activation of a feature.

Still, my personal feeling is that we should target the overall Commons search system effectiveness for users, rather than focusing on eventual recall changes due to the activation of a feature.

We already have a dataset for that though, with a queue of images to be classified and an existing classification interface - obvs more data would be good, but this ticket is really more about the synonyms stuff

CBogen edited projects, added Structured-Data-Backlog; removed Structured-Data-Backlog (Current Work).May 23 2022, 4:40 PM

CBogen moved this task from Triage to MediaSearch on the Structured-Data-Backlog board.

Slst2020 added a project: WMF-Inspiration-Week-2022-ML-Collab.Jul 8 2022, 1:51 PM

SNowick_WMF subscribed.Jul 8 2022, 9:15 PM

mfossati removed mfossati as the assignee of this task.Tue, Apr 16, 9:08 AM

[L] Gather labeled data relevant to synonymsOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

[L] Gather labeled data relevant to synonyms
Open, Needs TriagePublic
Actions