Run a test in relevance forge to estimate effects of rewriting misspelled queries
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	EBernhardson
	Feb 23 2016, 9:11 PM

Description

We could extract a dictionary, likely from wiktionary, and use it to spell correct queries. For example facebok could be transformed into (facebok OR facebook). This is a very common optimization for ZRR that is applied on major search engines like bing and google.

If the results in relevance lab look promising we could port the php spell checking extension (http://php.net/manual/en/ref.pspell.php) to hhvm and run an AB test in production.

Event Timeline

EBernhardson created this task.Feb 23 2016, 9:11 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptFeb 23 2016, 9:11 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

• Deskana triaged this task as Medium priority.Feb 23 2016, 11:19 PM

• Deskana added a project: Discovery-Search (Current work).

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.Feb 24 2016, 7:06 PM

My first naive attempt at this is horrible, the worst offences are proper names. A better dictionary might help, but we might also need some sort of proper noun recognition ... not sure. Will try a few things. I think first step will be to build a dictionary from non-redirect titles in enwiki and enwiktionary, both separately and together.

Example rewrites:

El Chapo Guzman -> (El OR Eli) (Chapo OR cheapo) Guzman
Farah Karimi (actress) -> (Farah OR "fa rah") (Karimi OR karin) (actress)
GeForce 1000 series -> (GeForce OR force) 1000 series
LeBron James -> (LeBron OR liberian) James

There are some that arn't so bad though and might be worthwhile:

assasination classroom -> (assasination OR "assassination") classroom
brahimini kite -> (brahmini OR brahmani) kite
burnie sanders -> (burnie OR bernie) sanders
chainsmokers -> (chainsmokers OR "chain smokers")

EBernhardson claimed this task.Feb 29 2016, 5:56 PM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

I've now built up a dictionary by using titles from enwiktionary that have a category starting with English. I've added to that the first 500k names from an 'instance of human' query from WDQS. I've also added the top 90% of names (those reported) from the 1990 US census. I've then tokenized all this into words using the lucene tokenizer (via clj-tokenizer).

I've then filtered this list back down by extracting term frequencies from the enwik_content index.I have built multiple dictionaries containing terms the appear at least 1, 10, 50, 100, 200, 500 and 1000 times. This gives the following dictionary sizes:

min doc freq	words
1	1204112
10	669442
50	278297
100	179533
200	114928
500	63873
1000	41405

I then needed a query set to test with. I extracted 10k queries from our TestSearchSatisfaction2 schema. Each query is the first query of a search session, so this is before the user has had a chance to see their error / lack of relevant results, and tried to correct their query. The query used was:

select event_query
  from log.TestSearchSatisfaction2_14098806 tss
  join (select event_searchSessionId, min(timestamp) as timestamp
          from log.TestSearchSatisfaction2_14098806
         where timestamp > 20160221000000
           and wiki='enwiki'
           and event_action='searchResultPage'
         group by event_searchSessionId
       ) x 
    on tss.event_searchSessionId = x.event_searchSessionId 
   and tss.timestamp = x.timestamp 
 order by rand() desc 
 limit 10000;

I've now started running these through relevance lab. The first result set has completed, with min docfreq of 1. This would be our largest possible increase in recall. It reduced the zero result rate of my 10k query sample from 21.5% to 10.3%, an 11.2% reduction. This is a pretty huge reduction, but we don't yet have great way's to measure if the increased recall has had a negative effect on precision. Will run the other sets and post up a table of how the zero result rate changes based on "better", or at least more discriminating, dictionaries.

I have more complete numbers from last nights run in relevancylab, but sadly it turns out these queries i rewrote are being adjusted by CirrusSearch into not quite what i intended. foo (bar OR baz) becomes foo \(bar OR baz\). This means the OR statement is being applied at a higher level than just (misspelled OR corrected). I'm going to hack something together to allow testing this as envisioned, so elasticsearch can apply the search as intended.

Is does show that we can have some crazy reductions in ZRR by relaxing the AND requirement ...although that's not entirely unexpected. The best set i ran yesterday had a ZRR of 7.2%

EBernhardson renamed this task from Run a test in relevancy lab to estimate effects of rewriting misspelled queries to Run a test in relevance forge to estimate effects of rewriting misspelled queries.Mar 9 2016, 10:36 PM

EBernhardson moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.Mar 9 2016, 10:39 PM

@EBernhardson started work on this, but had to put it back into the backlog due to other work being prioritised, so I'm unassigning them. It might be worth pinging him if someone picks this up.

JustinOrmont subscribed.Mar 31 2016, 10:07 AM

EBernhardson removed a project: Discovery-Search (Current work).Apr 13 2016, 5:51 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 13 2016, 5:51 PM

Dropping this one from the current sprint. It's very interesting but turning into a time consuming thing. Once we are closer to having the Q4 goals finished we will bring this back.

• Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.Apr 21 2016, 10:20 PM

EBernhardson closed this task as Declined.Feb 14 2019, 9:48 PM

Run a test in relevance forge to estimate effects of rewriting misspelled queriesClosed, DeclinedPublicActions

Description

Event Timeline

Run a test in relevance forge to estimate effects of rewriting misspelled queries
Closed, DeclinedPublic
Actions