Page MenuHomePhabricator

[EPIC-ish][Milestone 2] Implement NLP Search Suggestion Method 2 for CJK languages
Open, MediumPublic

Description

This ticket is to track work by @Julia.glen on implementing NLP Search Suggestion “Method 2” for Chinese, Japanese, and Korean.

From the parent task:

  • Method 2: Use resources external other than search logs (e.g., dictionaries with word frequencies) as a source for spelling corrections, using existing open source proximity/spell checking algorithms. Only applicable to languages with relevant linguistic resources.

Please create sub-tasks or add details here as necessary.

Event Timeline

TJones triaged this task as Medium priority.Jan 3 2019, 8:06 PM
TJones created this task.
Restricted Application added a subscriber: revi. · View Herald TranscriptJan 3 2019, 8:06 PM
TJones renamed this task from [EPIC-ish][Milestone 3] Implement NLP Search Suggestion Method 2 for CJK languages to [EPIC-ish][Milestone 2] Implement NLP Search Suggestion Method 2 for CJK languages.Mar 20 2019, 3:45 PM

Korean Tokenization

@Julia.glen & @dcausse: Here are my suggested changes from current production text analyzer for Korean (see https://ko.wikipedia.org/w/api.php?action=cirrus-settings-dump )

  • Change nori_tokenizer's decompound_mode from mixed to discard
    • discard will break a compound into pieces (e.g., 가곡역 => 가곡, 역); mixed keeps the original, too (e.g., 가곡역 => 가곡역, 가곡, 역). For search that's good, but for suggestions we definitely don't want the duplication!
    • It's possible that keeping the original compound only (decompound_mode == none) may give better suggestions. It's a question of recall (discard) vs precision (none).
  • Char filter nori_charfilter should be kept...:
    • But drop the mapping "\\u0130=>I"; it's only necessary if we keep icu_normalizer instead of lowercase.
    • The two characters (\u00B7 and \u318D) that map to space (\u0020) are two dot-like characters that are used like commas in lists ( · and ㆍ ). Otherwise strings using those characters may not be tokenized correctly (a bug fix is in the works upstream). I think it's okay if those get converted to spaces in suggestions.
    • We can keep the mappings from soft hyphen (\u00AD) and zero-width non-joiner (\u200C) to nothing; most people don't know they are using them (they often cut-n-paste from another source), and they are not normally useful in Korean text.
  • Char filter nori_combo_filter should be kept if possible—it is for non-Korean text (esp. Cyrillic text, but also others); though dropping it would not be terrible if we worry that the pattern_replace is slow.
  • I'd drop nori_posfilter for now, since it is essentially a stop word filter, though it works on parts of speech (POS) rather than specific words. Some parts of speech are not useful for search, but shouldn't be dropped from suggestions if only because dropping them could look weird.
    • In the future, it may make some sense to not make suggestions on certain parts of speech, or possibly to combine certain parts of speech with the preceeding token. It's kinda like tokenizing John's as John + 's. We probably don't really want to mess with the 's.
  • nori_readingform converts Hanja to Hangul, which we decided we don't want (to stay as close to the original text as possible), so drop it.
  • If we want to stay as close to the original text as possible, also drop icu_normalizer and put back lowercase.
  • The nori_length filter is there to remove any (very rare) zero-length token that get generated. Seems like it is worth keeping.

The target analysis chain is below:

"tokenizer": {
	"nori_tok": {
		"type": "nori_tokenizer",
		"decompound_mode": "discard"
	},
}

"char_filter": {
	"nori_charfilter": {
		"type": "mapping",
		"mappings": [
			"\\u00B7=>\\u0020",
			"\\u318D=>\\u0020",
			"\\u00AD=>",
			"\\u200C=>"
		]
	},
	"nori_combo_filter": {
		"pattern": "[\\u0300-\\u0331]",
		"type": "pattern_replace",
		"replacement": ""
	},
}
"filter": {
	"nori_length": {
		"type": "length",
		"min": "1"
	},
},

"text": {
		"type": "custom",
		"tokenizer": "nori_tok"
		"char_filter": [
			"nori_charfilter",
			"nori_combo_filter"
		],
		"filter": [
			"lowercase",
			"nori_length"
		],
	}

See also:

Japanese Tokenization

This is based on my analysis of Kuromoji from 2017 and a review of the current Kurmoiji docs, with changes based on the needs of spelling suggestions.

The most important thing to note is how compounds are processed. With the default options, they cannot be broken up without keeping the original compound—unless we write a custom filter.

Notes:

  • I've changed the tokenizer mode from search (the default) to normal. I'm not happy about it, but search returns the compound and all its parts, which we don't want. extended mode doesn't include the full compound, but it emits unigrams for unknown words. So, normal seems to be the only option unless we want to do additional processing to filter the full compound.
    • There is an attribute on the token, positionLength, that is >1 for compounds and ==1 for everything else (non-compounds and compound sub-parts), so it is possible to write a custom filter to strip them. However, a very unscientific sample on text not queries indicates that compounds are actually moderately rare (<1%), so we can ignore the problem for now if we need to.
  • Fullwidth numbers are still tokenized oddly, so fullwidthnumfix fixes that.
  • kuromoji_baseform is lemmatizer, kuromoji_stemmer is a stemmer, and ja_stop removes stopwords, so we don't want those for suggestions.
  • cjk_width folds fullwidth ASCII variants into the equivalent basic Latin, and folds halfwidth Katakana variants into the equivalent Kana. This seems like a good normalization.
  • I've used lowercase instead of icu_normalizer for now, assuming we generally want to stay close to the original text.

The target analysis chain is below:

"tokenizer": {
	"kuromoji_tok": {
		"type": "kuromoji_tokenizer",
		"mode": "normal"
	},
}

"char_filter": {
	"fullwidthnumfix": {
		"type": "mapping",
		"mappings": [
			"\uff10=>0", "\uff11=>1", "\uff12=>2", "\uff13=>3",
			"\uff14=>4", "\uff15=>5", "\uff16=>6", "\uff17=>7",
			"\uff18=>8", "\uff19=>9",
		]
	},
}

"text": {
		"type": "custom",
		"tokenizer": "kuromoji_tok"
		"char_filter": [
			"fullwidthnumfix",
		],
		"filter": [
			"cjk_width",
			"lowercase",
		],
	}

See also:

Gehel raised the priority of this task from Medium to High.Sep 9 2020, 2:47 PM
MPhamWMF lowered the priority of this task from High to Medium.Mar 9 2022, 8:45 PM