ATM when we implement what we hope is an improvement to how MediaSearch works, we run the analysis in https://github.com/cormacparle/media-search-signal-test, check if the numbers have improved, and if they have we dither about whether to also run an A/B test before making the new implementation default
There are few issues with this, the main ones being:
1. we're not sure how good/representative/sensitive our labeled data is
* https://phabricator.wikimedia.org/T280368
2. we don't have a great understanding of how to measure the quality of a result set of search results ... atm we only really use precision, probably we should also be considering things like [[ https://en.wikipedia.org/wiki/Discounted_cumulative_gain | DCG ]]
3. we're not sure how well the search-satisfaction-score the A/B tests use reflect user satisfaction with search
Here are some ideas for how to address the issues:
Labeled data
---
We have ~10k labeled image/search-term pairs out of a set of ~70M images. All of that data has been used to train our search algorithms
We don't know how much labeled data we need to reasonably represent the total corpus of images. Statistical techniques might exist to determine this - we don't know them , but we could investigate
We don't have a dedicated set of independent test data to see how well searches perform independently of the training data. We ought to gather one.
====Proposal====
1. investigate previous research on selecting representative data
2. gather the data we need (training data and test data) - probably as part of this we should create a new repo containing the test data we already have from https://github.com/cormacparle/media-search-signal-test that's designed specifically to gather, classify and test image/search-term data on a long-term basis
* T280245
Measuring the quality of search result sets
---
Our current analysis tools are fairly rough-and-ready,
There are tweaks in our code for specific edge cases, like search terms that might match many wikidata items, or multi-word search strings. We have manually tested that these work for a small number of test cases, but we don't have a general way of running experiments on these kind of small changes to check if they haven't made things worse in other ways, or of making sure that they don't get undone by subsequent changes
====Proposal====
1. Work with the search team on adapting relevance forge to work with our labeled data and MediaSearch T268653
2. Integrate with the new labeled data repo mentioned previously
3. Design an initial experiment to test an improvement for a subset of search terms based on relforge/labeled data
Search satisfaction score
---
The existing search satisfaction score is designed for text docs.
====Proposal====
1. Have detailed discussions with Search/Morten to see what we can work out from instrumentation of MediaSearch results
2. A user-testing round, where we get users to rate their search experience and compare that to the data gathered via their search session