We need some ways to be able to run batches of queries in an off-line manner and evaluate how a change to the engine changed our results. We have some basics built up here in relevance lab but we don't calculate any top level numbers that give a feel for if the results are really better, or just different.
One possibility is to use nDCG, but this requires having a measure of relevancy from 0..3 for the results. We could build out something in relevancy lab to have team members (people with access to PII queries) give these 0..3 scores to results. It might take some time, but we would be able to build up a corpus of query -> result scorings for use in generating metrics.
We might have to put some thought into where to store this data such that we can use it when doing evaluations. Perhaps we can work out something with legal where queries are 'vetted' and available for public disclosure after they have gone through the process. The process would likely need to have an additional level where you can rank a query as PII and remove it from the query set? not sure.