Measuring what the users want, and whether they like the results they got is very hard, of course, but we could still measure the magnitude of the effect of a change before we ever deploy it.
For example, we could take a random sub-sample of a day's query traffic (1K, 10K, 100K , or 1M samples, depending), and submit those queries to two (or a hundred!) variations of the relevant indexes, and then measure the change in several metrics:
- # queries with zero results
- # queries with changes in order in the top-N (5?, 10?, 20?) results
- # queries with new results in the top-N results
- # queries with changes in total results (very pretty 2-D graphs await!)
- etc.
This will let us very quickly test whether a change even does anything. Obviously, for example, a change that has no effect on the top 20 results for any of 100K queries isn't going to be a game changer.
To do this, we need:
- a cluster in labs to send the data to and do the analysis on.
- automation to clear out old test indices and/or bring in new indices from prod
- sets of queries to test against a few different wikis
- figure out necessary machinery to export config from prod and import it while running search tests.