⚓ T238151 Tune Glent Method 1 algorithm

Subject	Repo	Branch	Lines +/-
Modify M0 Scoring and Score Combination	search/glent	master	+35 -26
Add option to use ICU tokenization for query normalization	search/glent	master	+57 -21
Add ICU Tokenizer and Retokenizer Class to Glent for M1	search/glent	master	+288 -22
Add Token-Aware Edit Distance Docs	search/glent	master	+266 -0
Add Token-Aware Edit Distance Class	search/glent	master	+1 K -0

Status	Assigned	Task
Open	None	T212884 [EPIC] Improve Search Suggestions with NLP (Did You Mean / Glent)
Open	None	T212889 [EPIC-ish][Milestone 1] Implement NLP Search Suggestion Method 1 for 10 languages
Resolved	TJones	T238151 Tune Glent Method 1 algorithm
Resolved	EBernhardson	T247469 Cirrus needs to pass along search syntax info to Glent for M1
Resolved	EBernhardson	T247898 Add new columns for Glent Method 1

I should have created this ticket a while back. I've been working on this for a while!

TJones updated the task description. (Show Details)Nov 12 2019, 9:15 PM

TJones updated the task description. (Show Details)Nov 14 2019, 1:45 AM

TJones updated the task description. (Show Details)Nov 14 2019, 6:19 PM

TJones updated the task description. (Show Details)

TJones updated the task description. (Show Details)Nov 14 2019, 10:34 PM

TJones updated the task description. (Show Details)Nov 19 2019, 3:36 PM

TJones updated the task description. (Show Details)Nov 19 2019, 3:40 PM

TJones updated the task description. (Show Details)Nov 19 2019, 10:45 PM

TJones updated the task description. (Show Details)Nov 21 2019, 12:16 AM

TJones updated the task description. (Show Details)Nov 21 2019, 2:38 AM

I've implemented most of the desired features (per-token limits is the big one left), and I'm seeing better-looking results so far.

I've hacked together a poor version of tokenization that splits on whitespace and punctuation (we should use the Elastic language analyzer for tokenization, but this works for my tests).

I'm seeing improvements already, so I'll highlight some of them:

Able to detect things that aren't actually different:
- cgas/sting vs cgas sting
- coburg oregon vs coburg, oregon
- queries with a different number of spaces (e.g., two spaces between words instead of one)
Duplicate characters are cheaper
- agripinna vs agrippina has a cost of 0.1 instead of 2.0
- depp learnning vs deep learning has a cost of 0.15 instead of 3.0
Inserting or deleting only spaces/token boundaries is cheaper
- mack th eknife vs mack the knife has a cost of 0.2 instead of 2.0
- coolmathgames vs cool math games has a cost of 0.2 instead of 2.0
- adh hormone vs a.d.h hormone has a cost of 0.2 instead of 2.0
Swapping a letter and space is more expensive
- hassan vs has an has a cost of 1.5 instead of 1.0
Changing the first letter of a word is more expensize
- barbie and ken vs barbie and ben has a cost of 1.25 instead of 1.0
Swaps are cheaper
- 100 oldest pepole vs 100 oldest people has a cost of 1.25 instead of 2.0
Changing numbers is more expensive
- 2018 figure skating vs 2010 figure skating has a cost of 1.33 instead of 1.0
- 6th gen ipad vs 4th gen ipad has a cost of 1.55 instead of 1.0
  - for a letter, the cost in the new system would be 1.25 (see barbie and ken above)
Number penalties and other discounts interact
- georgia senate 2020 vs georgia senate 2002 has a cost of 1.58 instead of 2.0
  - for letters, the cost of the swap in the new system would be 1.25 (see hassan above)
- big brother 2000 vs big brother 20 has a cost of 0.76 instead of 2.0
  - for letters, the cost of the duplicates would be 0.1 (see agripinna above)
- The goal is to still allow these kinds of edits, but to prefer corrections to the words rather than the numbers.

Some undesirable matches also have higher costs
- sports in 1800s vs sports in 1990s has a cost of 2.09 instead of 2.0
- 11 attacks vs is attacks and 1960s riots vs 1960s hits and axial velocity vs radial velocity and billy mckenzie vs kelly mckenzie each have a cost of 2.25 instead of 2.0
- 1500m world record vs 100 world record and 190 cm vs 10 com and 1949 uk vs 1999 us each have a cost of 2.33 instead of 2.0
- aleination vs all nation and alex d vs a e d and maccarena vs man arena each have a cost of 2.5 instead of 2.0
- h(z) vs (a) has a cost of 4.0 instead of 2.0
- c3 c4 plants vs c3 plants (with two spaces) has a cost of 4.08 instead of 2.0
- the 43 vs the ( has a cost of 4.41 instead of 2.0
- g.e.o vs g.. has a cost of 5.5 instead of 2.0

All of these are using heuristic discounts and penalties that I made up. I will be optimizing the values later.

TJones updated the task description. (Show Details)Nov 26 2019, 10:15 PM

TJones updated the task description. (Show Details)Dec 11 2019, 11:33 PM

Change 559560 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] [WIP] Add Token-Aware Edit Distance Class

https://gerrit.wikimedia.org/r/559560

gerritbot added a project: Patch-For-Review.Dec 19 2019, 5:50 PM

I've got the first fully-functional version—the per-token bookkeeping was a pain!—up on Gerrit for review. I'll be adding tests and docs, and working on choosing the best parameters next.

Change 559560 merged by jenkins-bot:
[search/glent@master] Add Token-Aware Edit Distance Class

https://gerrit.wikimedia.org/r/559560

Maintenance_bot removed a project: Patch-For-Review.Feb 20 2020, 8:10 PM

TJones updated the task description. (Show Details)Mar 4 2020, 9:01 PM

TJones updated the task description. (Show Details)

I've finished optimizing the query paramters. Some small tweaking probably remains, but the original M1 query examples had 28% precision for "non-negative" ("meh" and better) results. The production DYM suggester had 54% precision.

The optimized M1 with new edit distance (on 125 queries with ~6K suggestions total) results depend on F-score weighting:

F_1.00: 79% precision, 71% recall—recall and precision weighted equally
F_0.50: 81% precision, 65% recall—precision 2x as important
F_0.33: 90% precision, 50% recall—precision 3x as important

Whatever settings we choose, it'll obviously be giving better suggestions on enwiki!

Write-up is forthcoming.

Change 578661 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Add Token-Aware Edit Distance Docs

https://gerrit.wikimedia.org/r/578661

gerritbot added a project: Patch-For-Review.Mar 11 2020, 1:15 AM

Change 578661 merged by jenkins-bot:
[search/glent@master] Add Token-Aware Edit Distance Docs

https://gerrit.wikimedia.org/r/578661

That last patch merged the docs for the token-aware edit distance class.

Next step is to open more tickets for some of the sub-tasks on this ticket.

TJones updated the task description. (Show Details)Mar 11 2020, 9:59 PM

We haven't yet figured out all the integration details for the tokenization, but here is the configuration for the edit distance builder with the relevant params for everything but the tokenization.

TokenAwareEditDistance M1EditDist = new TokenAwareEditDistance(
    EDConfig.Builder.newInstance()
        .setDefaultLimit(1.76f)
        .setDefaultNormLimit(0.3f)
        .setPerTokenLimit(true)
        .setNormType(NormType.MIN)
        .setInsDelCost(0.84f)
        .setSubstCost(0.92f)
        .setSwapCost(0.82f)
        .setDuplicateCost(0.60f)
        .setSpaceOnlyCost(0.68f)
        .setDigitChangePenalty(0.26f)
        .setTokenDeltaPenalty(0f)
        .setTokenInitialPenalty(2.00f)
        .setTokenSepSubstPenalty(0.54f)
        .setTokenSep(' ')
        // defaults below
        //.setLocale(Locale.ENGLISH)
        //.setTokenSplit("[\\p{Z}\\p{P}\\p{S}]+")
        //.setTokenizer(null)
        .build());

TJones updated the task description. (Show Details)Mar 17 2020, 7:36 PM

TJones updated the task description. (Show Details)Mar 17 2020, 7:39 PM

TJones updated the task description. (Show Details)Mar 17 2020, 7:48 PM

TJones updated the task description. (Show Details)

TJones updated the task description. (Show Details)Mar 17 2020, 7:54 PM

TJones updated the task description. (Show Details)Mar 24 2020, 2:09 PM

At @dcausse's suggestion, I reviewed the changes made by ICU normalization to make sure that it is appropriate for use in Glent.

I reviewed the BMP, SMP, and SIP (U+0000–U+2FFFF)—193,019 characters in all.

Most importantly for Glent:

Java encoding of high/low surrogate pairs—like 𤋮 (U+FA6C) → "\uD850\uDEEE"—needs to be addressed. Converting one character to 12 is more than a little crazy, but it would also lead to all sorts of false similarities, as we saw in T168427.
I don't love the conversion of the stand-alone diacritical characters to space+combining form, but it's probably a low-frequency problem.

Overall, if we can handle the Java encoding, we'll be fine. I think I prefer using the character filter over the token filter to get a more consistent treatment of similar characters by the tokenizer.

Full report: What the Heck Does ICU Normalization Do, Anyway?

In T238151#5994869, @TJones wrote:

Full report: What the Heck Does ICU Normalization Do, Anyway?

I've added a sub-page with the full list of characters mapped by ICU normalization in Elasticsearch, for those who are curious, or who want to test their font rendering abilities.

Interestingly, MediaWiki does some of the same conversion automatically, so many characters got converted by pasting into the wiki. However, I marked them, and all characters have U+xxxx formatted codepoints with links to FileFormat.Info, which gives all the gory details for each character.

As part of my 10% time, I pulled out the token-aware edit distance algorithm from Glent, and wrote a simple command-line driver for it, and created a repo on GitHub.

TJones updated the task description. (Show Details)Apr 13 2020, 4:47 PM

Change 593866 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Add ICU Tokenizer and Retokenizer Class to Glent for M1

https://gerrit.wikimedia.org/r/593866

gerritbot added a project: Patch-For-Review.May 1 2020, 11:47 PM

TJones updated the task description. (Show Details)May 11 2020, 5:35 PM

After looking at M‍1 examples in English, French, and Korean, I've come up with a better scoring function for choosing the suggestion: suggestion_score = log(f*f*h′)/3 – ed

The parts of the formula, using the example query motorgead, are below:

f = frequency of suggestion (motorhead: 721; motored: 161; motorhead\: 9)
h = # hits for suggestion (motorhead: 1985; motored: 115,834; motorhead\: 1982)
- note that diffs between motorhead and motorhead\ are probably chronological
h′ = min(10_000, h) + h/1000 (motorhead: 1985; motored: 10,115.8; motorhead\: 1982)
- this maxes out the effects of # hits at ~10K; the h/1000 term keeps values sorted by size if everything else is actually equal.
ed = edit distance from query (motorhead: 0.92; motored: 1.68; motorhead\: 0.92)

suggestion_score = log(f*f*h′)/3 – ed (motorhead: 2.085; motored: 1.126; motorhead\: 0.815)
- the final score emphasizes the frequency of the suggestion, downplays really large numbers of results, and penalizes large edit distance (and smooths over small differences in the weights used in the edit distance scoring)

Note: scale-based increases to frequency (say, doubling of traffic or doubling of the length of time we keep data) has no effect on suggestion ranking. Doubling all frequencies increases all scores by ~0.2, for example.

We may talk about changing the normalization of h′ for smaller wikis in the future. For frequencies below 10K, scale-based changes don’t change ranking, but for frequencies that cross that threshold in either direction, it can.

Change 593866 merged by jenkins-bot:
[search/glent@master] Add ICU Tokenizer and Retokenizer Class to Glent for M1

https://gerrit.wikimedia.org/r/593866

While trying to implement the suggestion_score above (which I've renamed frhedScore, for suggestion frequency, suggestion hit count, and edit distance), I ran into the problem of it not being very compatible with the formula for combining M0, M‍1, and M‍2 scores (in part because it can be negative, which interacts poorly with a big multiplicative scaling factor). The current combo score is:

Screen Shot 2020-05-20 at 5.02.53 PM.png (154×1 px, 35 KB)

I propose using frhedScore for M0, faking a frhedScore for M‍2 (which has no frequency or hit count information), and using an additive scheme for merging them that allows very high scores for a less-prefered method to beat a very low score for a more-preferred method:

Screen Shot 2020-05-20 at 5.03.19 PM.png (334×1 px, 77 KB)

More details and explanation on Mediawiki.

If this sounds like a good idea, I'll look a little harder at M0 and M‍2 suggestions and make sure this looks good for them.

After some discussion with @dcausse and @EBernhardson we remembered and/or decided that M‍2 (for the CJK languages) should not interact with M0 and M‍1—so we'll only be using (M0 and M‍1) OR M‍2—but not all three.

There have been some distractions and other priorities since then, but the modified proposal is below:

The -5 adjustment would not apply to M‍2 since it should always be in the 8–10 range.

I suggest changing M‍2 to 10 – ed rather than (10 – ed) * 10^5 to save a multiplication and so all methodScores have the same rough order of magnitude, which is within [-10, 10].

Updated details and explanation on Mediawiki.

I'll look a little harder at M0 suggestions and make sure this looks good for them.

Gehel closed subtask T247469: Cirrus needs to pass along search syntax info to Glent for M1 as Resolved.Jul 13 2020, 12:54 PM

Gehel closed subtask T247898: Add new columns for Glent Method 1 as Resolved.

CBogen moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Jul 22 2020, 2:32 PM

TJones moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.Jul 22 2020, 4:42 PM

TJones moved this task from Waiting to In Progress on the Discovery-Search (Current work) board.Jul 24 2020, 8:13 PM

TJones moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.Aug 10 2020, 4:04 PM

TJones moved this task from Waiting to In Progress on the Discovery-Search (Current work) board.Aug 11 2020, 5:24 PM

TJones updated the task description. (Show Details)Aug 18 2020, 4:27 PM

Update: M0 and M‍1 won't work the same because of the different frequency counts for M0. Also, the edit distance configuration for v is more restrictive than the default, and M0 works better with the less restrictive config. Rather than come up with a custom config for edit distance and a new frhedscore for M0, I'm going to use the current frhedscore for M0 (which will favor more frequenent corrections when we they exist) and let M0 always beat M‍1. More details on MediaWiki.

Change 626225 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Modify M0 Scoring and Score Combination

https://gerrit.wikimedia.org/r/626225

TJones updated the task description. (Show Details)Sep 9 2020, 8:57 PM

Change 626494 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Add option to use ICU tokenization for query normalization

https://gerrit.wikimedia.org/r/626494

TJones moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.Sep 10 2020, 8:40 PM

TJones updated the task description. (Show Details)Sep 10 2020, 8:43 PM

TJones updated the task description. (Show Details)

TJones moved this task from Waiting to Needs review on the Discovery-Search (Current work) board.

TJones mentioned this in T262610: Enable ICUTokNorm() for Glent M0 and M1.Sep 10 2020, 8:55 PM

Change 626494 merged by jenkins-bot:
[search/glent@master] Add option to use ICU tokenization for query normalization

https://gerrit.wikimedia.org/r/626494

Change 626225 merged by jenkins-bot:
[search/glent@master] Modify M0 Scoring and Score Combination

https://gerrit.wikimedia.org/r/626225

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Sep 11 2020, 6:09 PM

It looks like we are currently running glent 0.2.3 which includes the patches referenced above. Checked the attached patches and it looks like everything is shipped. Should we close this and move on to figuring out how we want to put it in front of users?

Maintenance_bot removed a project: Patch-For-Review.Oct 19 2020, 7:10 PM

In T238151#6561773, @EBernhardson wrote:

It looks like we are currently running glent 0.2.3 which includes the patches referenced above. Checked the attached patches and it looks like everything is shipped. Should we close this and move on to figuring out how we want to put it in front of users?

Yep! Thanks, Erik!

Gehel closed this task as Resolved.Oct 26 2020, 1:37 PM

Tune Glent Method 1 algorithm
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	F31834544: Screen Shot 2020-05-20 at 5.03.19 PM.png
	May 20 2020, 9:14 PM

	F31834542: Screen Shot 2020-05-20 at 5.02.53 PM.png
	May 20 2020, 9:14 PM

	F32253432: scorefunc .png
	Sep 9 2020, 7:45 PM

	F31911159: methodScore.png
	Jun 29 2020, 7:54 PM

Tune Glent Method 1 algorithmClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Tune Glent Method 1 algorithm
Closed, ResolvedPublic
Actions

Related Objects
Search...