Page MenuHomePhabricator

Fix Glent M2 CJK suggestion tokenization
Closed, ResolvedPublic3 Estimated Story Points

Description

User story: As a user of a CJK Wikipedia, I want suggestions with proper token breaks so the suggestions get the right results.

Notes: While reviewing Chinese data for T244800, I noticed that tokens in suggestions were being run together. This is sub-optimal for Chinese queries because users intentionally break up words to prevent incorrect tokenization (a holdover from the bigram days). As an example, a query like AB CD cannot be incorrectly tokenized as A BC D, but ABCD could. If the user searches for AB Cd, then AB CD (with a space) is a better suggestion than ABCD (without a space).

This is particularly terrible for Latin and other non-CJK tokens. Instead of a suggestion like john smith 探险家, we are generating johnsmith探险家, the Latin part of which will definitely not be tokenized correctly.

Update: After not finding the problem in the expected place (the Chinese analysis/tokenizer) I discovered that it's a problem for Japanese and Korean, too—I just didn't find it in my sample:

  • TOFU BEST ~ウチらのトーフビーツ~ → tofubestウチちのトーフビーツ
  • 이아코바 이탈렐리(Iakoba Taeia Italeli) → 이아코바이탈랠리iakobataeiaitaleli

It's also concatenating tokens across punctuation and not just spaces.

Acceptance criteria: CJK M‍2 analysis chain passes unit tests with Latin and/or non-CJK tokens that are not run together.

NB: This should probably be completed and deployed before any A/B test is run for M‍2—though it is not super common.

Event Timeline

TJones renamed this task from Review Chinese Analysis Chain for Glent M2 to Fix Chinese Analysis Chain for Glent M2.Oct 19 2020, 5:17 PM
Gehel triaged this task as High priority.Oct 28 2020, 1:28 PM
TJones renamed this task from Fix Chinese Analysis Chain for Glent M2 to Fix Glent M2 CJK suggestion tokenization .Mar 3 2021, 11:40 PM
TJones updated the task description. (Show Details)

Change 670257 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Fix Glent M2 CJK suggestion tokenization

https://gerrit.wikimedia.org/r/670257

I added a param to the tokenizer to preserve token separation when creating M2 suggestions. That broke suggestion creation because all single-character tokens are considered. I limited that to CJK characters, which also prevents trying to use Latin and other non-CJK characters in suggestions. (Oddly, the Chinese analyzer splits Greek and Cyrillic words into letters, so this also prevents trying to get suggestions out of в, и, к, и, п, е, д, и, ю, for example.)

Added a bunch of tests to GlentUtilsTest to cover the relevant cases and refactored it a bit. Added some new tests elsewhere because I thought they might be the source of the problem, even though they weren't. More tests is better tests, eh?

Also added some logic to prevent duplicate suggestion token candidates being created. It ain't much but it's honest work.

Change 670257 merged by jenkins-bot:
[search/glent@master] Fix Glent M2 CJK suggestion tokenization

https://gerrit.wikimedia.org/r/670257

Seems we are about ready, should i run a release on glent and update airflow with the new jar?

Seems we are about ready, should i run a release on glent and update airflow with the new jar?

Yeah, I think so!