Erik has started working on gathering data for determining which languages we want to test the search relevance on next. This might be determined by how many people we have access to that can help translate the questions, but we might already have enough words in translatewiki to accommodate this next set of tests.
Actually the difficulty here isn't necessarily translating the questions, translatewiki will help out there, the problems will be:
- Some pair of people with signed NDA's need to review a sampling of queries to remove any PII
- Do we need to run the first round of testing with multiple questions, like we did on enwiki? If we do, how will we evaluate the results without having a ground-truth answer set from discernatron to evaluate which question produced the best answers?
- To choose the wikis we will run the survey on we need to decide some sort of cutoff for how many query/page pairs we need information about for the data to be useful. Perhaps this isn't as big of a concern though, because the size of the training data it will be combined with is also a function of site popularity.
- We need to choose sampling rates that get enough data, but don't significantly bother people. IIRC the enwiki sampling rates worked out to displaying the survey around 0.1% of page views.
After running the test:
- we need some way to evaluate the results. We can apply the learned model from enwiki to translate the survey responses into relevance scores. But then we need to sort results by these relevance scores and get people that know the language to evaluate if those sorts are actually sane.
Can you take a read through @EBernhardson's comments here: T175049#3689055 and let us know if you have suggestions on how we can do our testing using languages other than English with our search relevance theories?