Page MenuHomePhabricator

Tune Glent Method 1 algorithm
Closed, ResolvedPublic

Description

As discovered in T232760, Method 1 (M‍1) suggestions can and should be improved before deployment or A/B testing. Tasks include:

  • Improving the edit distance computation:
    • special case costs
      • discount cost for swaps (ab vs ba)
      • discount cost for duplicate letters (ab vs abb)
      • add increased cost for editing digits
    • take tokens into account
      • add per token edit limits
      • add a penalty for differing numbers of tokens
      • decreased token delta penalty if strings differ only by spaces
      • add a penalty for changing the first letter of a token
        • except for letter/space swaps (ab cdef vs abc def)
    • Optimizations
      • early termination when over the edit distance limit
      • early termination when over the token delta limit
      • Investigate using Jaccard similarity on characters in the strings to terminate early (can't use just string length because of duplicate letter discount)
    • ensure support for 32-bit characters. (Ugh.)
  • Add docs to Glent repo
  • optimize penalties and discounts for the various cases above (based on M‍1 training data)
  • Incorporate edit distance into M‍1 suggestion selection
    • Investigate weighting of edit distance vs number of results
    • Review SuggestionAggregator.java / score() for best way to combine M0, M‍1, M‍2
  • Improve suggestion intake
    • Filter M‍1 queries with search syntax in them (esp. negated queries)—T247469
    • Set up minimal language analysis (standard tokenizer + ICU norm / lowercase) to queryNorm
      • Review what ICU normalization covers and doesn't cover
    • Deduplicate characters in queries (since duplicate letters are discounted)—dependent on T247898
  • Add filter at output (suggestion aggregator) to normalize chosen query (punct, whitespace, lowercase); re-use minimal language analysis from queryNorm [Dropping this, as there are some small potential pitfalls, and choosing suggestions partly based on frequency solves most of the problem.]

Event Timeline

I should have created this ticket a while back. I've been working on this for a while!

TJones updated the task description. (Show Details)

I've implemented most of the desired features (per-token limits is the big one left), and I'm seeing better-looking results so far.

I've hacked together a poor version of tokenization that splits on whitespace and punctuation (we should use the Elastic language analyzer for tokenization, but this works for my tests).

I'm seeing improvements already, so I'll highlight some of them:

  • Able to detect things that aren't actually different:
    • cgas/sting vs cgas sting
    • coburg oregon vs coburg, oregon
    • queries with a different number of spaces (e.g., two spaces between words instead of one)
  • Duplicate characters are cheaper
    • agripinna vs agrippina has a cost of 0.1 instead of 2.0
    • depp learnning vs deep learning has a cost of 0.15 instead of 3.0
  • Inserting or deleting only spaces/token boundaries is cheaper
    • mack th eknife vs mack the knife has a cost of 0.2 instead of 2.0
    • coolmathgames vs cool math games has a cost of 0.2 instead of 2.0
    • adh hormone vs a.d.h hormone has a cost of 0.2 instead of 2.0
  • Swapping a letter and space is more expensive
    • hassan vs has an has a cost of 1.5 instead of 1.0
  • Changing the first letter of a word is more expensize
    • barbie and ken vs barbie and ben has a cost of 1.25 instead of 1.0
  • Swaps are cheaper
    • 100 oldest pepole vs 100 oldest people has a cost of 1.25 instead of 2.0
  • Changing numbers is more expensive
    • 2018 figure skating vs 2010 figure skating has a cost of 1.33 instead of 1.0
    • 6th gen ipad vs 4th gen ipad has a cost of 1.55 instead of 1.0
      • for a letter, the cost in the new system would be 1.25 (see barbie and ken above)
  • Number penalties and other discounts interact
    • georgia senate 2020 vs georgia senate 2002 has a cost of 1.58 instead of 2.0
      • for letters, the cost of the swap in the new system would be 1.25 (see hassan above)
    • big brother 2000 vs big brother 20 has a cost of 0.76 instead of 2.0
      • for letters, the cost of the duplicates would be 0.1 (see agripinna above)
    • The goal is to still allow these kinds of edits, but to prefer corrections to the words rather than the numbers.
  • Some undesirable matches also have higher costs
    • sports in 1800s vs sports in 1990s has a cost of 2.09 instead of 2.0
    • 11 attacks vs is attacks and 1960s riots vs 1960s hits and axial velocity vs radial velocity and billy mckenzie vs kelly mckenzie each have a cost of 2.25 instead of 2.0
    • 1500m world record vs 100 world record and 190 cm vs 10 com and 1949 uk vs 1999 us each have a cost of 2.33 instead of 2.0
    • aleination vs all nation and alex d vs a e d and maccarena vs man arena each have a cost of 2.5 instead of 2.0
    • h(z) vs (a) has a cost of 4.0 instead of 2.0
    • c3 c4 plants vs c3  plants (with two spaces) has a cost of 4.08 instead of 2.0
    • the 43 vs the ( has a cost of 4.41 instead of 2.0
    • g.e.o vs g.. has a cost of 5.5 instead of 2.0

All of these are using heuristic discounts and penalties that I made up. I will be optimizing the values later.

Change 559560 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] [WIP] Add Token-Aware Edit Distance Class

https://gerrit.wikimedia.org/r/559560

I've got the first fully-functional version—the per-token bookkeeping was a pain!—up on Gerrit for review. I'll be adding tests and docs, and working on choosing the best parameters next.

Change 559560 merged by jenkins-bot:
[search/glent@master] Add Token-Aware Edit Distance Class

https://gerrit.wikimedia.org/r/559560

TJones updated the task description. (Show Details)

I've finished optimizing the query paramters. Some small tweaking probably remains, but the original M1 query examples had 28% precision for "non-negative" ("meh" and better) results. The production DYM suggester had 54% precision.

The optimized M1 with new edit distance (on 125 queries with ~6K suggestions total) results depend on F-score weighting:

  • F_1.00: 79% precision, 71% recall—recall and precision weighted equally
  • F_0.50: 81% precision, 65% recall—precision 2x as important
  • F_0.33: 90% precision, 50% recall—precision 3x as important

Whatever settings we choose, it'll obviously be giving better suggestions on enwiki!

Write-up is forthcoming.

Change 578661 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Add Token-Aware Edit Distance Docs

https://gerrit.wikimedia.org/r/578661

Change 578661 merged by jenkins-bot:
[search/glent@master] Add Token-Aware Edit Distance Docs

https://gerrit.wikimedia.org/r/578661

TJones triaged this task as High priority.Mar 11 2020, 8:33 PM
TJones updated the task description. (Show Details)
TJones removed a project: Patch-For-Review.

That last patch merged the docs for the token-aware edit distance class.

Next step is to open more tickets for some of the sub-tasks on this ticket.

We haven't yet figured out all the integration details for the tokenization, but here is the configuration for the edit distance builder with the relevant params for everything but the tokenization.

TokenAwareEditDistance M1EditDist = new TokenAwareEditDistance(
    EDConfig.Builder.newInstance()
        .setDefaultLimit(1.76f)
        .setDefaultNormLimit(0.3f)
        .setPerTokenLimit(true)
        .setNormType(NormType.MIN)
        .setInsDelCost(0.84f)
        .setSubstCost(0.92f)
        .setSwapCost(0.82f)
        .setDuplicateCost(0.60f)
        .setSpaceOnlyCost(0.68f)
        .setDigitChangePenalty(0.26f)
        .setTokenDeltaPenalty(0f)
        .setTokenInitialPenalty(2.00f)
        .setTokenSepSubstPenalty(0.54f)
        .setTokenSep(' ')
        // defaults below
        //.setLocale(Locale.ENGLISH)
        //.setTokenSplit("[\\p{Z}\\p{P}\\p{S}]+")
        //.setTokenizer(null)
        .build());
TJones updated the task description. (Show Details)

At @dcausse's suggestion, I reviewed the changes made by ICU normalization to make sure that it is appropriate for use in Glent.

I reviewed the BMP, SMP, and SIP (U+0000–U+2FFFF)—193,019 characters in all.

Most importantly for Glent:

  • Java encoding of high/low surrogate pairs—like 𤋮 (U+FA6C) → "\uD850\uDEEE"—needs to be addressed. Converting one character to 12 is more than a little crazy, but it would also lead to all sorts of false similarities, as we saw in T168427.
  • I don't love the conversion of the stand-alone diacritical characters to space+combining form, but it's probably a low-frequency problem.

Overall, if we can handle the Java encoding, we'll be fine. I think I prefer using the character filter over the token filter to get a more consistent treatment of similar characters by the tokenizer.

Full report: What the Heck Does ICU Normalization Do, Anyway?

I've added a sub-page with the full list of characters mapped by ICU normalization in Elasticsearch, for those who are curious, or who want to test their font rendering abilities.

Interestingly, MediaWiki does some of the same conversion automatically, so many characters got converted by pasting into the wiki. However, I marked them, and all characters have U+xxxx formatted codepoints with links to FileFormat.Info, which gives all the gory details for each character.

As part of my 10% time, I pulled out the token-aware edit distance algorithm from Glent, and wrote a simple command-line driver for it, and created a repo on GitHub.

Change 593866 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Add ICU Tokenizer and Retokenizer Class to Glent for M1

https://gerrit.wikimedia.org/r/593866

After looking at M‍1 examples in English, French, and Korean, I've come up with a better scoring function for choosing the suggestion: suggestion_score = log(f*f*h′)/3 – ed

The parts of the formula, using the example query motorgead, are below:

  • f = frequency of suggestion (motorhead: 721; motored: 161; motorhead\: 9)
  • h = # hits for suggestion (motorhead: 1985; motored: 115,834; motorhead\: 1982)
    • note that diffs between motorhead and motorhead\ are probably chronological
  • h′ = min(10_000, h) + h/1000 (motorhead: 1985; motored: 10,115.8; motorhead\: 1982)
    • this maxes out the effects of # hits at ~10K; the h/1000 term keeps values sorted by size if everything else is actually equal.
  • ed = edit distance from query (motorhead: 0.92; motored: 1.68; motorhead\: 0.92)
  • suggestion_score = log(f*f*h′)/3 – ed (motorhead: 2.085; motored: 1.126; motorhead\: 0.815)
    • the final score emphasizes the frequency of the suggestion, downplays really large numbers of results, and penalizes large edit distance (and smooths over small differences in the weights used in the edit distance scoring)

Note: scale-based increases to frequency (say, doubling of traffic or doubling of the length of time we keep data) has no effect on suggestion ranking. Doubling all frequencies increases all scores by ~0.2, for example.

We may talk about changing the normalization of h′ for smaller wikis in the future. For frequencies below 10K, scale-based changes don’t change ranking, but for frequencies that cross that threshold in either direction, it can.

Change 593866 merged by jenkins-bot:
[search/glent@master] Add ICU Tokenizer and Retokenizer Class to Glent for M1

https://gerrit.wikimedia.org/r/593866

While trying to implement the suggestion_score above (which I've renamed frhedScore, for suggestion frequency, suggestion hit count, and edit distance), I ran into the problem of it not being very compatible with the formula for combining M0, M‍1, and M‍2 scores (in part because it can be negative, which interacts poorly with a big multiplicative scaling factor). The current combo score is:

Screen Shot 2020-05-20 at 5.02.53 PM.png (154×1 px, 35 KB)

I propose using frhedScore for M0, faking a frhedScore for M‍2 (which has no frequency or hit count information), and using an additive scheme for merging them that allows very high scores for a less-prefered method to beat a very low score for a more-preferred method:

Screen Shot 2020-05-20 at 5.03.19 PM.png (334×1 px, 77 KB)

More details and explanation on Mediawiki.

If this sounds like a good idea, I'll look a little harder at M0 and M‍2 suggestions and make sure this looks good for them.

After some discussion with @dcausse and @EBernhardson we remembered and/or decided that M‍2 (for the CJK languages) should not interact with M0 and M‍1—so we'll only be using (M0 and M‍1) OR M‍2—but not all three.

There have been some distractions and other priorities since then, but the modified proposal is below:

methodScore.png (298×1 px, 43 KB)

The -5 adjustment would not apply to M‍2 since it should always be in the 8–10 range.

I suggest changing M‍2 to 10 – ed rather than (10 – ed) * 10^5 to save a multiplication and so all methodScores have the same rough order of magnitude, which is within [-10, 10].

Updated details and explanation on Mediawiki.

I'll look a little harder at M0 suggestions and make sure this looks good for them.

Update: M0 and M‍1 won't work the same because of the different frequency counts for M0. Also, the edit distance configuration for v is more restrictive than the default, and M0 works better with the less restrictive config. Rather than come up with a custom config for edit distance and a new frhedscore for M0, I'm going to use the current frhedscore for M0 (which will favor more frequenent corrections when we they exist) and let M0 always beat M‍1. More details on MediaWiki.

scorefunc .png (302×898 px, 56 KB)

Change 626225 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Modify M0 Scoring and Score Combination

https://gerrit.wikimedia.org/r/626225

Change 626494 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Add option to use ICU tokenization for query normalization

https://gerrit.wikimedia.org/r/626494

TJones updated the task description. (Show Details)
TJones moved this task from Waiting to Needs review on the Discovery-Search (Current work) board.

Change 626494 merged by jenkins-bot:
[search/glent@master] Add option to use ICU tokenization for query normalization

https://gerrit.wikimedia.org/r/626494

Change 626225 merged by jenkins-bot:
[search/glent@master] Modify M0 Scoring and Score Combination

https://gerrit.wikimedia.org/r/626225

It looks like we are currently running glent 0.2.3 which includes the patches referenced above. Checked the attached patches and it looks like everything is shipped. Should we close this and move on to figuring out how we want to put it in front of users?

It looks like we are currently running glent 0.2.3 which includes the patches referenced above. Checked the attached patches and it looks like everything is shipped. Should we close this and move on to figuring out how we want to put it in front of users?

Yep! Thanks, Erik!