@Miriam in research built a demo classifying commons images by their inclusion in featured categories on commons, essentially generating a quality score. It would be interesting to evaluate this score in the context of boosting image search results. It probably shouldn't have a huge weight, but can nudge images up/down based on the quality score.
Rough outline of evaluation:
[ ] Collect a sample of a hundred or so media searches on commons. Hand filter to remove things that are hard to evaluate, not encyclopedic, etc. In the past this has been 10-20%.
[ ] Collect top n (1k? 8k?) results for each query into an index on relforge. This can probably be done by issuing a reindex request to relforge for each query and injecting our search query into the source query.
[ ] Collect titles of all of those docs into some number of files with one title per line and ship over to analytics cluster
[ ] Run the model on spark against all the titles script: P7468, model: stat1005.eqiad.wmnet:~ebernhardson/miriam_quality_model/output_graph_new.pb
[ ] Import results to relforge. Output of model is csv with 4 columns: page_id, title, score, error_message. When error_message is set score is NaN.
[ ] Try something with the scores and the scoring calculation :) Score is in [0, 1] so could try something like `base * (1 + 0.25 * (score - 0.5))` which gives +- 12.5% to the score?
[ ] Evaluation at this stage will mostly be human based. Use relforge software to look at how much the scores change ranking, evaluate some of the result sets it reports. Bonus points to somehow display the images in the relforge report, but could link somehow to the wmflabs instance and compare image lists there.
[ ] Super bonus points: Some simple html page with a dropdown for all the queries that hits the api and displays back an image grid for each ranker.