We have ~10k labeled image/search-term pairs out of a set of ~70M images. All of that data has been used to train our search algorithms
We don't know how much labeled data we need to reasonably represent the total corpus of images.
To try and get an idea of this, let's plot some learning curves and see what they look like.
The basic idea is to train your model starting with a small subset of your data, and measure the accuracy of the model. You gradually increase the size of the training dataset, retrain, and plot accuracy against training dataset size. The assumption is that at some stage the data will be representative enough that adding more data won't really affect the accuracy much, and that the shape of the plots will give us some idea of whether we've reached that stage
Here's the model training data
The basic steps are:
- select a random subset of size N of the training data, and use logreg.py to find search signal scores based on that
- set the search signal scores, and use our analysis tool to measure how good the search is (both tools output various information retrieval metrics, so you'll have to use your judgement to decide which are the best ones)
- repeat 1-2 for other random subsets of size N, get an average "accuracy" measure
- repeat 1-3 for increasing N all the way up to N = (all the labeled data we have)
- plot N against accuracy