Page MenuHomePhabricator

Run a test in relevance forge to estimate effects of rewriting misspelled queries
Closed, DeclinedPublic

Description

We could extract a dictionary, likely from wiktionary, and use it to spell correct queries. For example facebok could be transformed into (facebok OR facebook). This is a very common optimization for ZRR that is applied on major search engines like bing and google.

If the results in relevance lab look promising we could port the php spell checking extension (http://php.net/manual/en/ref.pspell.php) to hhvm and run an AB test in production.

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

My first naive attempt at this is horrible, the worst offences are proper names. A better dictionary might help, but we might also need some sort of proper noun recognition ... not sure. Will try a few things. I think first step will be to build a dictionary from non-redirect titles in enwiki and enwiktionary, both separately and together.

Example rewrites:

El Chapo Guzman -> (El OR Eli) (Chapo OR cheapo) Guzman
Farah Karimi (actress) -> (Farah OR "fa rah") (Karimi OR karin) (actress)
GeForce 1000 series -> (GeForce OR force) 1000 series
LeBron James -> (LeBron OR liberian) James

There are some that arn't so bad though and might be worthwhile:

assasination classroom -> (assasination OR "assassination") classroom
brahimini kite -> (brahmini OR brahmani) kite
burnie sanders -> (burnie OR bernie) sanders
chainsmokers -> (chainsmokers OR "chain smokers")

I've now built up a dictionary by using titles from enwiktionary that have a category starting with English. I've added to that the first 500k names from an 'instance of human' query from WDQS. I've also added the top 90% of names (those reported) from the 1990 US census. I've then tokenized all this into words using the lucene tokenizer (via clj-tokenizer).

I've then filtered this list back down by extracting term frequencies from the enwik_content index.I have built multiple dictionaries containing terms the appear at least 1, 10, 50, 100, 200, 500 and 1000 times. This gives the following dictionary sizes:

min doc freqwords
11204112
10669442
50278297
100179533
200114928
50063873
100041405

I then needed a query set to test with. I extracted 10k queries from our TestSearchSatisfaction2 schema. Each query is the first query of a search session, so this is before the user has had a chance to see their error / lack of relevant results, and tried to correct their query. The query used was:

select event_query
  from log.TestSearchSatisfaction2_14098806 tss
  join (select event_searchSessionId, min(timestamp) as timestamp
          from log.TestSearchSatisfaction2_14098806
         where timestamp > 20160221000000
           and wiki='enwiki'
           and event_action='searchResultPage'
         group by event_searchSessionId
       ) x 
    on tss.event_searchSessionId = x.event_searchSessionId 
   and tss.timestamp = x.timestamp 
 order by rand() desc 
 limit 10000;

I've now started running these through relevance lab. The first result set has completed, with min docfreq of 1. This would be our largest possible increase in recall. It reduced the zero result rate of my 10k query sample from 21.5% to 10.3%, an 11.2% reduction. This is a pretty huge reduction, but we don't yet have great way's to measure if the increased recall has had a negative effect on precision. Will run the other sets and post up a table of how the zero result rate changes based on "better", or at least more discriminating, dictionaries.

I have more complete numbers from last nights run in relevancylab, but sadly it turns out these queries i rewrote are being adjusted by CirrusSearch into not quite what i intended. foo (bar OR baz) becomes foo \(bar OR baz\). This means the OR statement is being applied at a higher level than just (misspelled OR corrected). I'm going to hack something together to allow testing this as envisioned, so elasticsearch can apply the search as intended.

Is does show that we can have some crazy reductions in ZRR by relaxing the AND requirement ...although that's not entirely unexpected. The best set i ran yesterday had a ZRR of 7.2%

EBernhardson renamed this task from Run a test in relevancy lab to estimate effects of rewriting misspelled queries to Run a test in relevance forge to estimate effects of rewriting misspelled queries.Mar 9 2016, 10:36 PM
Deskana subscribed.

@EBernhardson started work on this, but had to put it back into the backlog due to other work being prioritised, so I'm unassigning them. It might be worth pinging him if someone picks this up.

Dropping this one from the current sprint. It's very interesting but turning into a time consuming thing. Once we are closer to having the Q4 goals finished we will bring this back.