Page MenuHomePhabricator

[EPIC] Put decisions about mediasearch improvements on a sounder experimental footing
Open, Needs TriagePublic

Description

ATM when we implement what we hope is an improvement to how MediaSearch works, we run the analysis in https://github.com/cormacparle/media-search-signal-test, check if the numbers have improved, and if they have we dither about whether to also run an A/B test before making the new implementation default

There are few issues with this, the main ones being:

  1. we're not sure how good/representative/sensitive our labeled data is
  2. we don't have a great understanding of how to measure the quality of a result set of search results ... atm we only really use precision, probably we should also be considering things like DCG
  3. we're not sure how well the search-satisfaction-score the A/B tests use reflect user satisfaction with search

Here are some ideas for how to address the issues:

Labeled data

We have ~10k labeled image/search-term pairs out of a set of ~70M images. All of that data has been used to train our search algorithms

We don't know how much labeled data we need to reasonably represent the total corpus of images. Statistical techniques might exist to determine this - we don't know them , but we could investigate

We don't have a dedicated set of independent test data to see how well searches perform independently of the training data. We ought to gather one.

Proposal
  1. use learning curves to investigate how much labeled data we have (and need)
  2. gather the data we need (training data and test data) - probably as part of this we should create a new repo containing the test data we already have from https://github.com/cormacparle/media-search-signal-test that's designed specifically to gather, classify and test image/search-term data on a long-term basis

Measuring the quality of search result sets

Our current analysis tools are fairly rough-and-ready,

There are tweaks in our code for specific edge cases, like search terms that might match many wikidata items, or multi-word search strings. We have manually tested that these work for a small number of test cases, but we don't have a general way of running experiments on these kind of small changes to check if they haven't made things worse in other ways, or of making sure that they don't get undone by subsequent changes

Proposal
  1. Once T280245 is done, create a process or scripts for replacing our existing analysis tools with relforge or RRE (depending on which the search team prefers)
  2. Design an initial experiment to test an improvement for a subset of search terms based on relforge/labeled data

A/B testing

The metrics we used for A/B testing are no longer being gathered since we switched to MediaSearch. We need to make A/B testing possible again as soon as we can

Proposal
  1. Have detailed discussions with Search/Morten to see what we can work out from instrumentation of MediaSearch results
  2. A user-testing round, where we get users to rate their search experience and compare that to the data gathered via their search session

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Cparle updated the task description. (Show Details)