Page MenuHomePhabricator

[L] Gather labeled data relevant to synonyms
Open, Needs TriagePublic

Description

Adding synonyms to MediaSearch (see T258053 [1]) has greatly improved recall (i.e. the number of relevant results that get returned) for non-English languages. For example searching in Irish for "ialtóg" (bat) without synonyms gives 331 results, while search with synonyms gives ~3800 results.

Our existing labeled data shows a minor bump in search performance when synonyms search is included, but because the labeled data is mostly for English search terms it's unlikely to capture the big difference synonyms make to non-English searches. We'd like to capture the improved recall in our labeled data - maybe a few thousand query/image/rating non-English datapoints. The simplest way to do this is:

  • do a search with https://commons.wikimedia.org/w/index.php?search=YOUR_SEARCH_TERM&ns6=1&uselang=YOUR_LANGUAGE&mediasearch_synonyms
  • copy/paste the urls of some good/bad matches (ignore indifferent, they're not v useful) into https://media-search-signal-test.toolforge.org/bulk.html
  • tag your data with synonyms

Not sure how to decide on which search terms to use:


[1] There's currently a problem with response times with synonyms (see T293106), but let's ignore that for the purposes of this ticket

Event Timeline

Cparle updated the task description. (Show Details)
CBogen renamed this task from Gather labeled data relevant to synonyms to [L] Gather labeled data relevant to synonyms.Nov 17 2021, 5:37 PM

@Miriam and I brainstormed on the method and design of this task. Here are our general thoughts in order of importance:

  1. binary relevance judgments should be elicited, i.e., relevant / not relevant
  2. we need at least 3 judgments per (query, image) pair. This is the minimum requirement to compute agreement, thus making a robust ground truth
  3. the task interface should display a grid of images and the contributor should click on relevant ones
  4. the real-world click-through rate per (query, image) pair could be computed from the Commons search logs, based on a time span, a number of users per query, and a number of clicks per image. For instance, given the query Joey Ramone, this image was clicked 42 times by 77 users in 3 months.

And here's what we propose:

  1. INPUT = production Commons search logs
  2. sample queries based on traffic to get a mix of popular and rare ones
  3. for each query, sample the top K results to get a mix of likely positive and negative samples
  4. for each query, display the sampled results in a grid interface
  5. let contributors click on the relevant ones
  6. ensure 3 judgments per (query, image)) pair

We can adapt the currently available task interface at https://media-search-signal-test.toolforge.org/ (note that I just hacked it to make it work).

This is great @mfossati and @Miriam. I have 2 questions

First - atm we have only 1 judgement per query/image pair. I realise that there's a trade-off between having a large-enough representative sample of images and making sure the ground-truth is robust. From the work we've done in T280368 it looks like the sampled data we have currrently is representative enough, so maybe it's worthwhile improving the robustness of our ground truth ... but this ticket is about gathering labeled data specifically to capture the effect of including synonyms on non-English searches. The number of judgements we can expect to get is limited, so are you guys sure that we wouldn't be better off sticking with 1 judgement per pair just to make sure our sample of synonym-relevant data is large enough?

Second - I'm not sure I understand exactly how you're proposing to get the data from search logs. Are there search logs with this info on superset perhaps?

... oh, and one other. With grid view where the user clicks on relevant images, are we going to assume that if an image is not clicked on it's not relevant?

this ticket is about gathering labeled data specifically to capture the effect of including synonyms on non-English searches. The number of judgements we can expect to get is limited, so are you guys sure that we wouldn't be better off sticking with 1 judgement per pair just to make sure our sample of synonym-relevant data is large enough?

I totally understand your concern, we should probably start collecting 1 judgment and cover a fair volume of data. Once we have an idea of the cost, we can then estimate scaling up to more judgments.

Second - I'm not sure I understand exactly how you're proposing to get the data from search logs. Are there search logs with this info on superset perhaps?

No idea, just assuming Commons search logs live somewhere. Do you have any pointers or relevant people we can talk to?

... oh, and one other. With grid view where the user clicks on relevant images, are we going to assume that if an image is not clicked on it's not relevant?

Exactly. We should provide crystal-clear instructions, by the way.

No idea, just assuming Commons search logs live somewhere. Do you have any pointers or relevant people we can talk to?

@EBernhardson you might be able to help out here

No idea, just assuming Commons search logs live somewhere. Do you have any pointers or relevant people we can talk to?

@EBernhardson you might be able to help out here

Depends what you need. If you need the queries that were sent, we generally have those along with the results that were returned unsampled going back ~90 days in event.mediawiki_cirrussearch_request. Getting that kind of information from backend queries is quite tedious though, you need to devise ways to separate the requests you are interested in from everything else. We instrumented Special:Search and the skin autocomplete some years ago to more directly collect information about interfaces we are interested in, but as far as I'm aware none of that was ported to Special:MediaSearch

Thanks for the early feedback, @EBernhardson.

If you need the queries that were sent

Exactly.

we generally have those along with the results that were returned unsampled going back ~90 days in event.mediawiki_cirrussearch_request.

Nice, I've just had a quick look at it.

Getting that kind of information from backend queries is quite tedious though, you need to devise ways to separate the requests you are interested in from everything else.

Do you have any specific advice on this, apart from digging into the http column?

Thanks for the early feedback, @EBernhardson.

Getting that kind of information from backend queries is quite tedious though, you need to devise ways to separate the requests you are interested in from everything else.

Do you have any specific advice on this, apart from digging into the http column?

The problem is that backend doesn't know where anything comes from and various bots and UI's use generally the same api calls. However you separate things needs to be specific to the use case at hand, evaluating in what ways the desired requests look different from everything else. Sometimes specific UI's can be targeted through the http url if they have a distinctive query pattern, but that's not always the case. In the past this was a highly iterative process, evaluating the events that come out and figuring out if it's representative. Typically this dataset also needs to be filtered for bot activity, in the past only perhaps 25% of full text search requests came from the UI. We've often used rough heuristics like sampling one requests per ip address, or filtering ips with > n requests per day. This can still be ineffective with some modern bots running from a wide range of cloud ip addresses, it all depends. A slightly more intensive, but reasonably effective, method is to also filter ip's that rarely visit index.php (but this requires parsing webrequests, and filters mobile apps).

mfossati changed the task status from Open to In Progress.Feb 22 2022, 9:12 AM

Under a completely different perspective, we may leverage the Wikidata statements ranking system as readily available ground truth. This would ideally enable automatic collection of labeled data.
For instance, if some p18 statement is ranked as deprecated, that would cater for a negative sample, and vice versa for preferred statements.

Caveat: this SPARQL query shows some items having p18 deprecated statements: https://w.wiki/4oTf. If we look at them with the image search use case in mind, it's not evident why they were deprecated.
Counting all of them times out in the public SPARQL endpoint (https://w.wiki/4oTg), so we may dive into the data lake again to compute the total.

As a side note, here’s the list of Wikidata properties holding Commons media: https://www.wikidata.org/wiki/Special:ListProperties/commonsMedia.
Caveat: it looks like there’s no specific data type for images, just media in general.

Another source of ground truth might be images that were added then reverted within e.g. a day?

Implementation pointers:

Recap:

  • labeled data gathering task design implemented as per T293878#7635985
  • ensuring 3 judgments per (query, score) pair is expensive. The task currently supports 1, and we can scale up after a first round
  • query terms are collected from production Commons logs, following suggestions in T293878#7665381
  • a best effort to filter NSFW terms is in place, although we should expect to still see some
  • we will start eliciting judgments internally, but we envision to expand to a broader audience
NOTE: once labeled, we believe that the dataset can be a useful resource to evaluate image search engines in general. It would be great to release it under an open access license, along with a report to be submitted to relevant research venues. This can have a positive impact for researchers and practitioners.
mfossati moved this task from Doing to Blocked on the Structured-Data-Backlog (Current Work) board.
  • Marking as resolved: the implementation is done
  • moving to blocked until the labeling task reaches a satisfactory amount of judgments

Marking as resolved seems to make this task invisible in the workboard, so switching back.

Update on this ticket - looking at the data I'm not sure that what we've gathered is capturing the effect of the synonyms patch, and I think we might need to curate it more carefully.

Probably ought to pause further work on this ticket until we have a chance to consider this more closely

Update on this ticket - looking at the data I'm not sure that what we've gathered is capturing the effect of the synonyms patch, and I think we might need to curate it more carefully.

Probably ought to pause further work on this ticket until we have a chance to consider this more closely

@Cparle can you say more about why it's not capturing the effect and what curation you think we need?

Marco's sampled the search terms from the logs based on a mixture of popularity and random, but just looking at the sampled search terms for French, for example, very few of them match up with wikidata labels and therefore won't have any synonyms ... and seeing that the point of this exercise is to capture the effect of the synonyms patch, we've probably been barking up the wrong tree.

A better approach might be to pick our search terms from wikidata labels in the language we're interested in. Probably the wikidata items ought to be connected to real wiki articles so we don't end up searching for very obscure things like An07g10070 (a gene), and we'll probably need some other criteria too like skipping items with the same label in all languages or items that are instances of a Wikimedia category or a scholarly article etc etc

As a result of the conversation with @Miriam (see T293878#7635985 and T293878#7825473), we agreed to raise the bar and opt for a general-purpose task design: the main goal was to collect a real-world dataset that would enable the evaluation of multilingual image search engines in general.
We believe this can have a significant impact on the broad research community, so we ended up in a much more ambitious task.

While I agree that the current dataset is not really for comparing a system with synonyms VS a system without, I still think it can cater for Commons search evaluation with the synonyms feature included.
The key reason is: it's a random sample of Commons production logs, so real queries made by real users.

For the sake of this task, I agree that we should fine-tune the data collection part, at the cost of less randomness and more evaluation bias.
This shouldn't require much effort: the task is there, we just have to switch the data behind. An easy step would be to look into the full dataset again (not the current sample displayed by the task, this is a small portion), then come up with a more eloquent slice, rather than a random one.

Still, my personal feeling is that we should target the overall Commons search system effectiveness for users, rather than focusing on eventual recall changes due to the activation of a feature.

Still, my personal feeling is that we should target the overall Commons search system effectiveness for users, rather than focusing on eventual recall changes due to the activation of a feature.

We already have a dataset for that though, with a queue of images to be classified and an existing classification interface - obvs more data would be good, but this ticket is really more about the synonyms stuff