Page MenuHomePhabricator

Evaluate adding an image quality score to media search result ranking
Closed, ResolvedPublic

Description

@Miriam in research built a demo classifying commons images by their inclusion in featured categories on commons, essentially generating a quality score. It would be interesting to evaluate this score in the context of boosting image search results. It probably shouldn't have a huge weight, but can nudge images up/down based on the quality score.

Rough outline of evaluation:

  • Collect a sample of a hundred or so media searches on commons. Hand filter to remove things that are hard to evaluate, not encyclopedic, etc. In the past this has been 10-20%.
    • Picked ~100 queries from logs, data available in hdfs:///user/dcausse/image_qual/commons_queries_handpicked.lst
  • Collect top n (1k? 8k?) results for each query into an index on relforge and Miriam's model on results. (Fetched 200K images using the search API).
    • Data available in stat1005:~dcausse/commons_img_quality/preds_filtered.csv
  • Import results to relforge
  • Try something with the scores and the scoring calculation :) Score is in [0, 1] so could try something like base * (1 + 0.25 * (score - 0.5)) which gives +- 12.5% to the score?
    • used a simple weighted sum for now
  • Evaluation at this stage will mostly be human based. Use relforge software to look at how much the scores change ranking, evaluate some of the result sets it reports. Bonus points to somehow display the images in the relforge report, but could link somehow to the wmflabs instance and compare image lists there.
    • I was not able able to use exactly the same profile as production on a subset of the data, term stats are too different, I could use a simple profile with the all field which gives similar results on relforge and production. I'm currently importing all commons files to relforge so that we can actually compare against the production profiles.
  • Super bonus points: Some simple html page with a dropdown for all the queries that hits the api and displays back an image grid for each ranker.
    • A small frontend app might required, the jsondiff.py is not well suited for this, the default size of the thumbnail images is also too small.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 20 2018, 10:37 PM

For running the inference i found it easiest to use a Makefile and split the work up into various small files. The Makefile i used to start testing is below. This essentially looks for files in commons_titles/*.txt and runs them through spark one at a time.

SRC = $(wildcard commons_titles/*.txt)
TARGET = $(SRC:.txt=.txt.completed)

all: $(TARGET)

commons_titles/%.txt.completed : commons_titles/%.txt
        PYSPARK_DRIVER_PYTHON=venv/bin/python \
                PYSPARK_PYTHON=venv.zip/bin/python \
                /usr/lib/spark2/bin/spark-submit \
                --master yarn \
                --archives venv.zip \
                --conf spark.executor.memoryOverhead=4096 \
                --conf spark.dynamicAllocation.maxExecutors=50 \
                classify_image_quality_spark.py \
                --image_titles $< \
                --outfile $<-results
        touch $@

You can grab venv.zip from the same directory as the model. It can also be recreated by making a python3 virtualenv, installing tensorflow and requests, and then zipping that up.

@EBernhardson Thanks for putting this together, if you need help, please let me know! I would be also happy to know more about the evaluation environment / instructions given to human evaluators.
CC @DarTar for visibility

dcausse updated the task description. (Show Details)Aug 27 2018, 7:32 PM
dcausse updated the task description. (Show Details)
dcausse updated the task description. (Show Details)Aug 27 2018, 7:36 PM

@Miriam: evaluation is done by us for now, it's just the very first step to adjust the knobs to something reasonable. If we manage to find a good balance between the quality score and the relevance score we should discuss the next steps.
Thanks!

dcausse updated the task description. (Show Details)Aug 30 2018, 2:05 PM
dcausse updated the task description. (Show Details)Sep 3 2018, 12:45 PM

Stats comparing all field * templates_boost vs all field + 0.6*quality score (on a subset of commons)

Metrics:

Query Count: 90
Zero Results Rate: 0.0%
Poorly Performing Percentage: 0.0%
Top 1 Unsorted Results Differ: 76.7%
Top 3 Sorted Results Differ: 97.8%
Top 3 Unsorted Results Differ: 96.7%
Top 5 Sorted Results Differ: 97.8%
Top 5 Unsorted Results Differ: 95.6%
Top 20 Sorted Results Differ: 100.0%
Top 20 Unsorted Results Differ: 93.3%

Production templates are:

  • Template:Assessments/commons/featured (weight 2.5, 11K images)
  • Template:Picture of the day (weight 1.5, 5K images)
  • Template:Valued image (weight 1.75, 27K images)
  • Template:Assessments (weight 1.5, 23K images)
  • Template:Quality image (weight 1.75, 187K images)

The impact is huge.

I'll have to re-adjust the weights/formula once the full import is done, I'll work on a better UI to display the changes in the meantime.

Using the formula suggested by Erik (a boosting factor 1 + 0.25 * (score-0.5)) I get fewer differences:

Metrics:
   Query Count: 90
   Zero Results Rate: 0.0%
   Poorly Performing Percentage: 0.0%
   Top 1 Unsorted Results Differ: 42.2%
   Top 3 Sorted Results Differ: 87.8%
   Top 3 Unsorted Results Differ: 73.3%
   Top 5 Sorted Results Differ: 93.3%
   Top 5 Unsorted Results Differ: 77.8%
   Top 20 Sorted Results Differ: 98.9%
   Top 20 Unsorted Results Differ: 87.8%

UI to vizualize: https://commons-defaults-relforge.wmflabs.org/myw/comp.html (may not work well with safari)

Remarks:

  • there seems to be a bias towards wide aspect ratio possibly ranking images like website banners high (http://commons-defaults-relforge.wmflabs.org/myw/comp.html#ted%20kennedy)
  • knowing the type of media the user wants is not trivial and is probably something we should try to detect before using aesthetic quality of images
    • for paris map users may prefer maps rather than high quality images.

I made a minor change to the comparison to put more pictures on screen at once (added .grid to filename as well): http://commons-defaults-relforge.wmflabs.org/myw/comp.grid.html#dangerous%20animal

Will have to find some time to look over the changes some more. Indeed it looks like a mixed bag at a quick glance.

Miriam added a comment.EditedSep 4 2018, 6:48 PM

@dcausse @EBernhardson thanks for this!
Q: Which metrics should I have in mind when I look at results? E.g. when should I consider a result bad, vs very good?
Not sure if that helps, but for the Wikidata image rankings I got the best results when I was re-ranking by quality score the top-X (20 or any small X) relevant images.
THanks!

EBernhardson added a comment.EditedSep 4 2018, 6:54 PM

I'm not really sure how to evaluate these with any rigor. This is our first time around doing anything with images, and actually the main thrust of this task is to start building our familiarity and processes around improving image search. Which is to say we need to come up with a way, try it out, and iterate on it. I don't have any suggestions yet, but will think on it.

@EBernhardson from my previous work on in image search, there are two ways to evaluate search results:

  • Quantitavely: based on historical data, you see how well your ranking reflects the clickthrough rate on images returned for the same query. You can also use hisotrical data to tune your parameter, e.g. the weight given to the quality score.
  • Qualitatively: this involves manually looking at the results.
    • First, you define evaluation metrics (m1,m2), say, overall relevance and quality of the top-10. You give a weight of importance to each metric: say, relevance is twice more important than quality (w_m1=2, w_m2=1).
    • You then manually look at individual query results from each method, and give scores for each metric, say, for query X and method A, quality is 8 and relevance is 5 (s_m1(X,A)=5;s_m2(A,A)=8).
    • You compute the goodness of method X for a query Y as the sum of all metric scores multiplied by their weight goodness(x,a)=s_m1*w_m1+s_m2*w_m2.
    • You then average this 'goodness' score over all queries, and you have a unique score to define the accuracy of a method.

Hope this helps!

Change 458720 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Add relforge settings for commons image quality testing

https://gerrit.wikimedia.org/r/458720

@Miriam thanks for your suggestions!

  • Analyzing the performance vs our click data will require running your model against many more images and I wonder if it'd not be simpler to simply run the model against all images on commons.
  • Qualitatively: thanks, we have been looking at rescoring the top-8000 by mixing the relevance and the quality score, we can try to rescore only the top-10. I'll look into setting such profile.

Change 458731 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikimedia/discovery/relevanceForge@master] Add small UI tool to compare image results using relforge.

https://gerrit.wikimedia.org/r/458731

Change 458720 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add relforge settings for commons image quality testing

https://gerrit.wikimedia.org/r/458720

Change 458731 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Add small UI tool to compare image results using relforge.

https://gerrit.wikimedia.org/r/458731

Analyzing few queries from commons and manually tuning relevance signals using relforge tools for the first time was very insightful. The goal was to try to evaluate the benefit of adding a query independent signal based on aesthetic quality computed by a tensorflow model provided by Miriam Redi.

Data set
The data was manually picked-up from a random sample of CirrusSearchRequestSet query filtered by ips that performed less than 30 queries per day.
I filtered :

  • PII
  • Non english queries
  • Queries unrelated to images
  • NSFW

The final set had 90 queries.

Image set
The set of images has been fetched using the search api, a total of 194,213 images were properly analyzed on a total of 208,263 requested.
Errors were mostly due to GIFs that are not supported by the current model.

Test index
In order to have equivalent ranking between production and the relforge index I had to fully import the commons file index. The quality score of the 194K images was injected at import time using relforge tools.

Tuning
Finding the best way to combine scores in IR is always a bit complex and I decided for this first approach to use a simple boosting factor:
rel_score * (img_qual_score+1)^weight
With a default weight at 1.

UI Setup

  • A very small UI demo has been set up at:

https://commons-defaults-relforge.wmflabs.org/myw/comp.html

Very little efforts have been put in the UI.

Observations
Relevance is problematic on commons and evaluating a new ranking signal in these conditions is nearly impossible. This new signal is not supposed to bring more relevant images but only images of better quality but it was found in some examples that results are actually more relevant (e.g. https://commons-defaults-relforge.wmflabs.org/myw/comp.grid.html#Wildlife%20of%20India). This means that the relevancy is so bad originally that it’s beaten by a signal that has no relationship with the user query.

Looking at results it was not really clear how we could write guidelines for users to assess the results and allow us to decide whether or not such combination of scores may be beneficial for commons: if we ask graders to only focus on quality they may forget relevancy. Asking them to select the result they would have chosen may overweight relevancy and would result in evaluating the relevance score instead of the quality score.
To address this Miriam suggested that we limit the scope of the evaluation by only rescoring the top-10/top-5 and by varying the weight we could have asked graders to choose between two result sets. The advantage of this technique is that we really assess the ordering of only 5/10 elements which are common to both result sets. We did not pursue this technique due to technical limitations of the rescoring API provided by elasticsearch (it’s not currently possible to rescore the final resultset, only the shard results).

Another problem identified is that even by selecting only queries that were related to an image search it was not always evident that quality was a valid signal for such queries, e.g. searching for flags, symbols, maps is quite commons (e.g. https://commons-defaults-relforge.wmflabs.org/myw/comp.grid.html#map%20united%20states).

Relatedly how would we address this similar problem when keeping non-image queries (e.g. queries looking for PDFs, original book scripts, videos, audios)? Unlike many search engines we do not enforce the user to select the media type before searching.

Conclusions
Despite the problems that prevented us to run a full and objective evaluation I feel confident that image quality is a valid signal for ranking for image search. This idea is comforted by the fact that image quality has been used to re-rank search on commons since the early days of CirrusSearch (c.f. https://commons.wikimedia.org/w/index.php?title=MediaWiki:Cirrussearch-boost-templates&oldid=114897827). This image quality score could also improve the coverage of the current technique that uses templates:

  • Template:Assessments/commons/featured (weight 2.5, 11K images)
  • Template:Picture of the day (weight 1.5, 5K images)
  • Template:Valued image (weight 1.75, 27K images)
  • Template:Assessments (weight 1.5, 23K images)
  • Template:Quality image (weight 1.75, 187K images)

Assuming no overlap this is a theoretical max of 253K images that are tagged and given that commons currently host more than 49 millions images this is only 0.5% of the images. It seems wise to have an automatic method to increase the coverage of the current re-ranking technique.

The issue related to UI (that does not enforce the user to select the expected MediaType) might be solved by upcomming evolutions of the Search UI carried on by the SDoC project. This may simplify a lot the integration of such scores as we won’t have mixed content types.
We may also want to evaluate if there are simple NLP techniques that could help to identify simple use cases where image quality is not relevant (maps, schema, diagrams, symbol searches ...).

Minor caveats

debt closed this task as Resolved.Oct 5 2018, 4:01 PM