Page MenuHomePhabricator

Run tests to measure the expected zero results change of running zero result enwiki queries against other languages (ru, jp, etc?)
Closed, ResolvedPublic


I don't think we can do a full end to end test, getting the enwiki index along with the other language indexes into our hypothesis testing cluster is probably a bit too much to ask of it. We probably can though detect the language of zero result queries from enwiki and import the top 2-4 relevant indexes.

So basically:

  • Extract some number of zero result queries from enwiki request logs (ideally enough samples so we have enough foreign language queries)
  • Run all those queries against the language detection plugin and come up with a list of the most relevant indexes to import
  • Import the relevant indexes to the hypothesis-testing cluster
  • Run the queries through the language detector again, this time running them against the suggested indexes and report the results.

Or something like that, adjust as needed.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a project: CirrusSearch.
EBernhardson added a subscriber: EBernhardson.
Restricted Application added a project: Discovery. · View Herald TranscriptAug 20 2015, 4:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
EBernhardson renamed this task from Run tests to measure the expected zero results change of changing suggestions algorithm to Run tests to measure the expected zero results change of running zero result enwiki queries against other languages (ru, jp, etc?).Aug 20 2015, 4:46 PM
EBernhardson set Security to None.
EBernhardson updated the task description. (Show Details)Aug 24 2015, 8:35 PM
EBernhardson updated the task description. (Show Details)
Deskana triaged this task as Normal priority.Aug 27 2015, 4:54 PM
Deskana added a subscriber: Deskana.
TJones added a subscriber: TJones.EditedAug 27 2015, 7:56 PM

I manually reviewed 1047 zero queries from enwiki (full text). Results are here:

The section on foreign languages has a break down by language, though some music and other "content" searches were in other languages, (esp. Spanish and Portuguese).

On stat1002 there are two files:
/home/tjones/T109731.full — this one has all 1047 queries, with my category and api/web info
/home/tjones/T109731.trim — this one is trimmed down to the 173 queries that are not in English, including the ones categorized as foreign languages, and also items in other categories that are not in English. A few names may have slipped in too.

Maybe those are useful for gauging type and scope of foreign language content. Some categories need review (Cyrillic and Arabic are based on script, not language, for example.)

TJones added a comment.Sep 4 2015, 9:09 PM

My analysis of the performance of the ElasticSearch language identification plugin is here:

We can use the same format and metrics to measure and compare other language detection methods.

TJones added a subscriber: dcausse.Sep 8 2015, 2:42 PM

I've added a couple more sections to the Language Detection Evaluation. @dcausse discovered that the ElasticSearch plugin works better with spaces added around the query, so I tested that, and I evaluated the current method of "assume everything is English" to compare recall and precision numbers, which shows that recall and precision are not all that matter!

I've done further analysis on the ~1400 zero-results non-DOI query corpus, looking at the effects of perfect (or at least human-level) language detection, and the effects of running all queries against many wikis.

In summary:

More that 85% of failed queries to enwiki are in English, or are not in a particular language. Only about 35% of non-English queries in some language (<4.5% of zero-results queries), if funneled to the right language wiki, get any results.
The types of queries most likely to get results from the non-enwikis are names and queries in English. There are lots of English words in non-English wikis (enough that they can do decent spelling correction!), and the idiosyncrasies of language processing on other wikis allow certain classes of typos in names and English words to match, or the typos happen to exist uncorrected in the non-enwiki.
Perhaps a better approach to handling non-English queries is user-specified alternate languages.

More details:

Deskana closed this task as Resolved.Sep 14 2015, 8:22 PM

Based on the above discussion, I believe this task is resolved, and I'm marking it as such.