This should be a quick process (< 1/2 day total) to optimize the language choices for Italian, German, Spanish, and French for recall (F2) instead of precision (F0.5). There's some debate over which is better, so it would be good to know (a) if it makes a difference, and (b) what languages to use in each case.
The most obvious version of this, maximizing recall on the same data, doesn't do anything. Overall recall and precision are tightly coupled when only one language is allowed per query. This is because any false positive for one language is a false negative for another language, except in the rare cases where no answer can be given.
So, I dug into other ways of increasing recall, including allowing multiple languages to be given as results for each query, and completely ignoring the language of the wiki in question in order to maximize potential off-wiki results (i.e., "coverage"). I looked at 7 permutations for each of the 4 languages I'd recently annotated data for.
TL;DR: I prefer per-wiki tuning, but it seems like a reasonable generic recommendation for improving recall and/or coverage would be to allow a second language result from TextCat, and if you prefer coverage over accuracy, ignore the home language of the wiki.
Detailed results are here:
I think i have to agree with your preference for per-wiki tuning. Tuning for recll itself didn't do anything. Offering a second language seems to raise recall such a small amount that it might not be much of a win for our users. The analysis looks great and gives the decisions you've made a solid reasoning.
I guess the important thing to note is that we can expect to get similar or at least decent accuracy on languages other than the language of the wiki we're on if we don't include it (though some of these samples are too small to be definitive). The big difference is that we could make a lot more cross-wiki queries if we let everything try to be in some language other than the language of the wiki we're on—these are the potential "silly" queries.
Two questions remain: would they help much? (T136034), and do we mind our results potentially looking silly? (I'm talking to Design Research about that.)
Comments from @TJones, so that we can wrap up and close out this task:
I think [we can close it]. @EBernhardson, @dcausse and I talked about it more and I think everyone is on board with the current plan. Tuning for recall didn't change much, and if we really wanted to push recall (over precision) we could, say, try to get results in more than one language. The benefit (very low additional recall) outweighs the cost (double the search time) right now. We seem to be trending in the other direction at the moment, with the TextCat confidence idea, which would dial back recall for better precision.