Page MenuHomePhabricator

Analysis of Method 1 Suggestion results
Closed, ResolvedPublic

Description

Gather suggestion output from Elastic-based suggestions and Method 1 suggestions for a collection of data, and analyze the results.

When we did this for M0, we used 2 months of enwiki data to build the model and evaluated the results on 1 month of enwiki data. Something similar would be fine this time, too.

Analysis will include counting how often Elastic-based suggestions are made, how often Method 1 suggestions are made, how often both are made, and a manual review of a sample when both are made to see which does better—which is the same as what we did for M0.

There is some concern about the possibility of lower-quality Method 1 results for shorter strings, so if that looks to be a problem—either because of the high volume and/or lower quality of Method 1 suggestions for shorter queries—we may look into shorter queries more carefully.

Event Timeline

TJones renamed this task from Analysis of M1 results to Analysis of M1 Suggestion results.Sep 12 2019, 4:50 PM
TJones renamed this task from Analysis of M1 Suggestion results to Analysis of Method 1 Suggestion results.
TJones created this task.
TJones moved this task from needs triage to elastic / cirrus on the Discovery-Search board.
TJones updated the task description. (Show Details)

Samples are available here: notebook1004:/home/dcausse/phrase_suggester_vs_glent_m1.csv

I completed my analysis of Method 1, and it performs significantly worse than the current production DYM. I think we should improve Method 1 before considering an A/B test. Full details on MediaWiki.

Summary:

Method 1 Anti-Patterns:

  • over-emphasis of result counts—
    • creating negated queries, like fogus to -ous which gets 5.9M results.
    • changing letters or adding spaces to create a very common word (cf gene to a gene) or duplicated word (rattle battle to battle battle).
  • overly drastic changes—
    • edit distance limits should be per-token, not per string (cf gene to a gene again)
    • changing a letter to space should have a higher cost (abbys to a b s)
    • changing the first letter of a word/token should have a higher cost (cia assassinations to mi6 assassinations)
  • using weird stemming edge cases to increase result counts—
    • e.g., godness stems to god so it beats goddess; hering stems to here so it replaced herring in red herring

Reinforcing Positive Method 1 Patterns:

  • Edit distance cost should be decreased for double-letter to single-letter change (or vice versa)
  • Edit distance cost should be decreased for swapped letters, possibly including swapped with a letter in between (levasimole vs levamisole)

I realize that I've assumed that edit distance plays a role in the weighting of suggestions, but I'm not sure that's the case. If not, it probably should be, rather than letting result count reign supreme.