Page MenuHomePhabricator

[XL] Get a sense of how much labeled data we need by plotting learning curves
Open, Needs TriagePublic

Description

We have ~10k labeled image/search-term pairs out of a set of ~70M images. All of that data has been used to train our search algorithms

We don't know how much labeled data we need to reasonably represent the total corpus of images.

To try and get an idea of this, let's plot some learning curves and see what they look like.

The basic idea is to train your model starting with a small subset of your data, and measure the accuracy of the model. You gradually increase the size of the training dataset, retrain, and plot accuracy against training dataset size. The assumption is that at some stage the data will be representative enough that adding more data won't really affect the accuracy much, and that the shape of the plots will give us some idea of whether we've reached that stage

Here's the model training data

The basic steps are:

  1. select a random subset of size N of the training data, and use logreg.py to find search signal scores based on that
  2. set the search signal scores, and use our analysis tool to measure how good the search is (both tools output various information retrieval metrics, so you'll have to use your judgement to decide which are the best ones)
  3. repeat 1-2 for other random subsets of size N, get an average "accuracy" measure
  4. repeat 1-3 for increasing N all the way up to N = (all the labeled data we have)
  5. plot N against accuracy

Some references

Event Timeline

CBogen renamed this task from Get a sense of how much labeled data we need by plotting learning curves to [XL] Get a sense of how much labeled data we need by plotting learning curves.Apr 21 2021, 4:59 PM

tl;dr: gathering more labeled data does not look like it will measurably improve the precision of our results, so there's no point in making a big effort to do it


Here's the longer version with the approach I finally settled on after trying lots of different things

  1. used an elasticsearch featureset to get elasticsearch scores for the different search signals for each search term/labeled data pair we have
    • so for example I ran a search for Big Ben, and stored the scores from elasticsearch for descriptions.en, category, text, etc for each of the results that we have ratings for (e.g. the image Big Ben (8921228937).jpg rated as 'good', the image Paeonia 'Big Ben' (1992-1380*B) Buds.jpg rated as bad, etc)
  2. stored all the ratings and scores in a file
  3. shuffled the file randomly and split it into a training dataset and a test dataset
  4. trained a logistic regression model on the first N rows of the training dataset, for various values of N
  5. tested the model against the test dataset, and calculated average precision and precision@25 for the model
  6. plotted number of rows in the training dataset against average precision and precision@25 to see if more rows gives us better accuracy
  7. iterated from step 3 many times to smooth out effects of noise in the data
  8. calculated average precision and precision@25 for various N using the model via the search api, to make sure the effects of increasing N is comparable when using the real search engine

What I found was:

  • once you get past a fairly low number of samples (approx 1000) in the training data, adding more samples makes only very little difference to the precision scores
  • this effect is the same no matter how you shuffle the data, and is apparent when testing the test data set directly, or testing the whole dataset via the search api

Example plot of number of samples versus average precision (the different colours are different shuffles of the data, with a different training/testing data split)

chart.png (269×434 px, 9 KB)

Google sheet with more data/workings out

FWIW here's the code used to train/test the logistic regression model
https://github.com/cormacparle/media-search-signal-test/blob/master/logreg.py

And there's the ranklib file used as a source of training data