The primary metric optimized for at the end of the training process is NDCG@k. It would be quite useful to take the original input data, after the relevance labels have been calculated, and determine the NDCG there as well. This will allow us to compare the results of training against the historical data we have been providing to users. I'm not sure if this will be super important in the long term, but in the short term it should give us a reasonable indication of if the training is able to give better results than what is already being provided by CirrusSearch.
Note that this will still be overly optimistic. We are only utilizing queries with >= 10 repeats in the training process, which makes up about 40% of all user sessions. As such we can only calculate the NDCG@k for those as well. It is possible, perhaps even likely, that the performance on queries with less repeats will be worse than the queries we are utilizing for training.