Page MenuHomePhabricator

Eliminate old M2 suggestions with improper tokenization
Open, Needs TriagePublic8 Estimated Story Points

Description

User story: As a user of a CJK Wikipedia, I want only suggestions with proper token breaks so the suggestions get the right results.

T265081 fixes the M2 tokenization problems, but old suggestions with poor tokenization are still in the database tables. It's not entirely clear what the best way to fix them is. Some options:

  • Find some way to remove the old suggestions and repopulate with freshly generated suggestions that use the new code; may require database surgery or other excess cleverness
  • Double check that we have 90 days of data for each of the CJK languages and just delete all existing suggestions and start over collecting them, using the last 90 days' worth of data.
    • Do something extra clever and only delete queries and suggestions with spaces, or with spaces between non-CJK characters.

Acceptability Criteria:

  • M2 suggestion table no longer contains suggestions with spaces removed from between non-CJK tokens (Latin, Cyrillic, etc.)

Event Timeline

TJones added a subscriber: EBernhardson.

@EBernhardson, please take a look at the description and add/correct anything that needs it. Thanks!

All seems reasonable to me.

I suspect it's difficult, if even possible, but if we (ok, mostly you) can come up with a regex that identifies the wrong queries that would be easy to apply as a one time filter. I guess since m2 operates on queries in isolation, there is no cross-query interraction, we can probably put something together that re-runs the current version of the algorithm over everything in the rolling history.

Alternatively, now that i have a better understanding of how m2 works it doesn't really even have to be an offline algorithm, i suspect you could implement a suggestions candidate generator in elasticsearch that applies this algorithms to source queries in real time. Certainly not something for right now though :)

I suspect it's difficult, if even possible, but if we (ok, mostly you) can come up with a regex that identifies the wrong queries that would be easy to apply as a one time filter.

  1. The simplest regex would be /\s/, which would catch anything with space in it. Based on the samples I pulled, I'd estimate ~10% of Chinese, ~20% of Japanese, and ~50% of Korean queries in the M2 table have spaces.
  2. A more restrictive regex would catch anything with a non-CJK character next to a space on either side: [^\p{Han}\p{Hangul}\p{Hiragana}\p{Katakana}\s]\s|\s[^\p{Han}\p{Hangul}\p{Hiragana}\p{Katakana}\s]. That matches 5% of Chinese, 11% of Japanese, and 10% of Korean queries.
  3. Even more restrictive would require a non-CJK character on both sides of one or more spaces: [^\p{Han}\p{Hangul}\p{Hiragana}\p{Katakana}]\s+[^\p{Han}\p{Hangul}\p{Hiragana}\p{Katakana}]. 1.5% Chinese, 6% Japanese, 3% Korean.
  4. The most restrictive would just look for A-Z and numbers on each side of one or more spaces: [A-Za-z0-9]\s+[A-Za-z0-9]: 1% Chinese, 2.5% Japanese, 1% Korean.

#1 would catch everything, and it's pretty easy as regexes go! Spaces seem a lot more common in Korean, so I'd prefer this one. #4 is very restrictive, but it would catch the most egregious examples, and with an eventual time out it'd be okay. #2, #3, and #4 aren't aware of punctuation, so something..like..this..维基百科 could still get a suggestion like somethinglikethis维基百科—but such cases are much more rare.

BTW, if you use #2 or #3, check that your language of choice supports \p{Han}, etc. Java uses \p{IsHan}, for example. I can work out the ranges for them if that's better.

I'm not entirely sure if it's correct, but in theory ebernhardson.glent_suggestions/algo=m2run/date=20210313 should contain all the queries in the regular m2run history, but re-run with the updated algorithm. This is mostly a naive re-shaping of the historical data to look like it's a log and processing it that way.

The re-run only gives 188,623 suggestions, where the source had 270k suggestions. But the source only had 193k distinct source queries, the re-run has 185k. We don't typically bother to collapse a historical suggestion with a new suggestion for the same thing, instead allowing the SuggestionAggregator that operates over all suggester implementations to do that part. Perhaps this reduction is mostly simply removing duplicates, and then dropping a couple queries that now return themselves as best variant.

I tried reviewing some of the changes, particularly the 7,870 queries that used to have suggestions but no longer do, and it's not clear to me. For example 俸納 used to suggest 奉納 but doesn't any more. This doesn't seem to match the patterns we are dealing with here, but perhaps that was a previous bug that was fixed and we are only now doing a rerun of historical m2?

I've placed all of the query,old_dym pairs that didn't generate new suggestions in stat1007.eqiad.wmnet:/home/ebernhardson/glent_m2_rerun_filtered.csv

Change 673155 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[search/glent@master] Run dictionary suggester against historical queries

https://gerrit.wikimedia.org/r/673155

I tried reviewing some of the changes, particularly the 7,870 queries that used to have suggestions but no longer do, and it's not clear to me. For example 俸納 used to suggest 奉納 but doesn't any more. This doesn't seem to match the patterns we are dealing with here, but perhaps that was a previous bug that was fixed and we are only now doing a rerun of historical m2?

I've placed all of the query,old_dym pairs that didn't generate new suggestions in stat1007.eqiad.wmnet:/home/ebernhardson/glent_m2_rerun_filtered.csv

I started by looking at these, since there is obviously the chance for something fishy to be going on. I've taken some (non-random) examples from your csv file and I ran them as unit tests to see how they behaved. I had to guess on the language for each of them, and that led to some interesting results, too.

I have a few answers, a few theories, and a few questions....

First off, 俸納 doesn't get any suggestions in Chinese, but it does get 奉納 as a suggestion in Japanese. They use the same confusion tables, but different frequency tables. Which brings up a few thoughts:

Could you re-generate glent_m2_rerun_filtered.csv with the language used?

Is there any chance you ran everything under one language? (If you ran everything as Chinese, it might explain some of the filtering. I doubt it, but it's worth asking, just in case.)

Do you have any recollection of early data having been run with a mixed frequency table?

There were a few examples of two CJK–character queries that got suggestions in one language, but not others.

Some examples genuinely should not have gotten suggestions. The query was "b e d" and it previously got a suggestion of "bad", but now with the spaces preserved, it can't get suggestions. There were only a few of these.

Lots of others got suggestions that matched their old suggestions, just with spaces added. As an example, "19世紀 殖民越南" used to get "19世纪殖民越南" (in Chinese) and now it gets "19世纪 殖民越南" (with the space in the expected place).

So, there does seem to be something hinky going on..

Are there any later steps that might filter things? Both 19世纪殖民越南 and 19世纪 殖民越南 get the same results (664) on Chinese Wikipedia, so that shouldn't be it. Hmm. 19世紀 殖民越南 gets 650 results, so maybe the numbers changed and the suggestion isn't getting enough results to be considered better?

I gotta run for today, but I thought I'd dump what I found so far here in case you have more time to look.

First off, 俸納 doesn't get any suggestions in Chinese, but it does get 奉納 as a suggestion in Japanese. They use the same confusion tables, but different frequency tables. Which brings up a few thoughts:

Could you re-generate glent_m2_rerun_filtered.csv with the language used?

I've additionally re-run glent using the code i shipped to gerrit, the previous run was from WIP so slightly different. This new version gives suggestions for 182549 queries, vs the 188623 the first rerun gave. Going the wrong direction. Went from 7,807 queries with no suggestions to 17,077 with no clear reason why. Almost certainly though my refactor seems to have changed something, I can only commit the test, not the data, due to PII but i'll setup a test that runs a sample of queries through the recent glent commits and see where exactly things are changing and verify that all those changes are expected. I can probably back out portions of various refactors in pieces and figure out where it goes wrong.

first rerun: ebernhardson.glent_suggestions/algo=m2run/date=20210313
second rerun: ebernhardson.glent_suggestions/algo=m2run/date=20210317

And here is a csv, while looking over the code i did for the last one, it has a variety of problems beyond not accounting for wiki. It was also mis-reporting a ton of extra rows that weren't in my initial count. Here is a rerun of yesterdays and the new one for the second rerun attempt if you want to poke at them.

first attempt: stat1007.eqiad.wmnet:/home/ebernhardson/glent_m2_first_rerun_missing.csv
second attempt: stat1007.eqiad.wmnet:/home/ebernhardson/glent_m2_second_rerun_missing.csv

Is there any chance you ran everything under one language? (If you ran everything as Chinese, it might explain some of the filtering. I doubt it, but it's worth asking, just in case.)

I don't think so, everything is isolated per-row and the rows are providing the language info.

Do you have any recollection of early data having been run with a mixed frequency table?

I don't remember ever fixing anything there, it would be possible to happen if incorrect language data was ever loaded into the canonical_data.wikis table. I'm not aware of that happening, but not impossible.

Lots of others got suggestions that matched their old suggestions, just with spaces added. As an example, "19世紀 殖民越南" used to get "19世纪殖民越南" (in Chinese) and now it gets "19世纪 殖民越南" (with the space in the expected place).

So, there does seem to be something hinky going on..

Are there any later steps that might filter things? Both 19世纪殖民越南 and 19世纪 殖民越南 get the same results (664) on Chinese Wikipedia, so that shouldn't be it. Hmm. 19世紀 殖民越南 gets 650 results, so maybe the numbers changed and the suggestion isn't getting enough results to be considered better?

This algo doesn't take ever take the hits into account, it remembers the number of hits the source query had but never uses it. The dym query has There is additional filtering happening but it shouldn't effect this (hopefully?):

  • The dym is discarded if dym.equals(String.join("", tokens)). This should only happen when the dym doesn't end up suggesting anything

Change 673607 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Update Glent CJK_CHAR_PAT

https://gerrit.wikimedia.org/r/673607

first attempt: stat1007.eqiad.wmnet:/home/ebernhardson/glent_m2_first_rerun_missing.csv
second attempt: stat1007.eqiad.wmnet:/home/ebernhardson/glent_m2_second_rerun_missing.csv

Thanks.. these were very helpful!

Obviously I didn't check them all, but all of the entries in glent_m2_first_rerun_missing.csv contained a space or punctuation, symbols, or other characters (including: 〈/@“”‘’、·(),。.()+•?―】[]【+‧&.\=:;□!*:;=々・~!-◯●"・'^_−↑」゜「〔〕》~ㆍ☆|?*㎜][×/&Ⓦ¥♡○�—〜-) that would split tokens and thus plausibly explain why it didn't get a suggestion.

There were also bunch of Japanese queries with "KATAKANA-HIRAGANA PROLONGED SOUND MARK" (ー) that weren't getting suggestions, but should have been. I reviewed the IsHiragana and IsKatakana regex classes (and learned that they are different from InHiragana and InKatakana—what the heck?) and added a few additional characters to the CJK_CHAR_PAT (patch above).

glent_m2_second_rerun_missing.csv

I didn't see an obvious pattern among the extra queries here, so I think there's something not quite right in that group.

I've spent some time friday and again today looking at the queries found in the second csv but not the first. Everything i've looked at (only a few dozen) seems reasonable on closer inspection. The only particularly suspicious thing is there are a class of queries that don't have a dym in the rerun, but when i run them through the test suite they provide the expected suggestion. Since the suggestion algo seems to still be correct i put together a test case that runs the whole suggester, simulating the input dataframes, but still looks reasonable. I'm not really finding an answer, tempted to call it "good enough".

Change 673607 merged by jenkins-bot:
[search/glent@master] Update Glent CJK_CHAR_PAT

https://gerrit.wikimedia.org/r/673607

Change 673155 merged by jenkins-bot:

[search/glent@master] Run dictionary suggester against historical queries

https://gerrit.wikimedia.org/r/673155

Merged, released as glent-0.2.4. This release contains all the updates we've made to m2 in the last few months as that seems to be wrapping up.

Change 685039 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Bump glent to 0.2.4

https://gerrit.wikimedia.org/r/685039

Change 685039 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Bump glent to 0.2.4

https://gerrit.wikimedia.org/r/685039