Page MenuHomePhabricator

Filter and sort search results of Japanese kana search queries in accordance with how much of the query appears as a consecutive substring
Closed, DeclinedPublicFeature

Description

Feature summary (what you would like to be able to do and where):

The search results for Japanese kana search queries should be filtered and sorted based on whether (or, how much of) the query appears as a consecutive sub-string in the result. Ideally, one or two transpositions, substitutions, deletions or insertions should be forgiven when searching, though the resulting forms (the queries derived by performing these four transformations once or twice) should still be required to appear "mostly" consecutively within the pages or their titles. "Mostly" because sometimes, an intermediate kanji should be forgiven (e.g. searching for "ひさし振り" should still yield 久し振り, 久しぶり or ひさしぶり despite the query not being a perfect sub-string); so when the query and a page share some consecutive runs, that should still count for something.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

When searching for Japanese kana text, the advanced search returns many spurious results that merely contain the requested kana somewhere in the article individually with no consideration for consecutiveness. To give some examples:

When searching for the string "ちいさ" (https://en.wiktionary.org/w/index.php?title=Special:Search&limit=500&offset=0&profile=default&search=%E3%81%A1%E3%81%84%E3%81%95&ns0=1) which is the stem of ちいさい / 小さい, a real word, we observe the following in the search result page:

  1. 小さい is the first result (good)
  2. The 3 entries whose titles contain "ちいさ" as a substring appear as the 2nd, ≈250th and ≈800th result respectively. They should all be close to the top.
  3. Already the 3rd (あさいち) and 4th (いさちる) result don't contain the string "ちいさ" at all but merely a permutation of it.

When searching for the nonsensical string "るよふぃ" (https://en.wiktionary.org/w/index.php?limit=500&offset=0&profile=default&search=%E3%82%8B%E3%82%88%E3%81%B5%E3%81%83&title=Special:Search&ns0=1), we observe that, of the 131 search results, only 7 contain the mora "ふぃ" at all. Note that the small "ぃ" is a combining character, it combines with the preceding character to represent one mora together. Digraphs of normal character + combining character should be given extra weight in terms of consecutiveness.

When searching for a long, randomly generated string (https://en.wiktionary.org/w/index.php?search=%E3%81%A0%E3%82%89%E3%81%91%E3%81%AB%E3%81%A9%E3%81%A8%E3%82%80%E3%82%89%E3%81%82%E3%81%8A%E3%81%84%E3%81%BB%E3%81%97%E3%82%8A%E3%81%86%E3%81%BF%E3%81%8B%E3%81%9F+%E3%81%82%E3%82%8B%E3%81%8F%E3%81%82%E3%81%BE%E3%81%84%E3%81%AA%E3%81%A4&title=Special:Search&profile=advanced&fulltext=1&ns0=1), there's still hundreds of results.

Related discussion: https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2022/July#Search_result_ordering?

Benefits (why should this be implemented?):

Search results that are almost identical to the query are buried under hundreds or even thousands of spurious hits that merely contain a random permutation of the search query. Everybody who looks up Japanese words using kana would benefit from being presented with the near hits first because it's more likely to be what they were interested in.

Event Timeline

TL;DR: Sorry not to have better news. Processing of Japanese text on non-Japanese wikis is inconsistent and weird and complicated and hard to change or improve. Searching with the judicious use of double quotes and spaces may help improve accuracy of search results for Japanese kana queries on English-language wikis.


@Fytcha: Unfortunately, I don't think we're likely to see this level of complexity in processing Japanese scripts (or generally for CJK text) on English-language wikis.

I'll try to give some background on the sub-optimal way things currently work so maybe you can craft queries that are better at finding what you are looking for.

Tokenization is the process of breaking up text into tokens (which are often words, but not necessarily). English-language wikis use Lucene's "standard" tokenizer, which is not very smart about CJK characters. For some unknown reason, it treats hiragana and katakana differently, too. Hiragana is broken up into one-character pieces, so ひらがな is broken into four tokens: ひ, ら, が, な. Katakana is not broken up at all, so カタカナ stays together as one token. (I've opened an upstream bug on this, but it may actually be "correct" in that the Unicode Segmentation Algorithm seems to treat hiragana and katakana differently. I'm not very familiar with Japanese, but that seems weird to me.)

However, our "main" language processing pathway (the "text" field) for English maps hiragana to katakana (a requested feature, see T176197: Allow hiragana searches to find katakana results and vice versa), so ひらがな is converted to ヒラカナ before tokenization. The "text" processing does lots of other stuff, too, though much of it is English-specific and not relevant here.

However, again! We have a secondary language processing pathway (the "plain" field) that does very little extra processing besides lowercasing and, for English, removing most diacritics. Since it keeps ひらがな as hiragana, it is tokenized as individual characters.

The upshot of all this is that when you search for katakana on English-language wikis, an entire string of katakana (without any spaces or punctuation) will be treated as one token (essentially one word), and will only match if the whole string matches. If you search in hiragana on English-language wikis, each individual character will be treated as a separate token (essentially a separate word), and while there is some ranking preference for those characters being near each other, it is not required. Characters that are less common (on the English-language wiki, which may or may not be the same as being less common in Japanese) will rank higher—much like less common English words rank higher (e.g., the ranks very low, while triskaidekaphobia is much more important if you search for the triskaidekaphobia.

Japanese-language wikis currently use the Lucene "CJK" language analyzer for the "text" field, which breaks words into overlapping bigrams (two-letter sequences). So ひらがな is tokenized as ひら, らが, and がな. This isn't great, but it is a compromise between one long token, and every letter is its own token. It does bad things across word boundaries, though, since the last character of one word and the first character of the next word create a token—it's fine for finding an exact phrase, but not when the words of the search query may not appear together in the article text.

I've seen evidence that on Chinese-language wikis (which used to use the CJK language analyzer, too) that sophisticated searchers would break their query into words with spaces to prevent those less useful bigrams from being created (the overlapping bigrams don't cross spaces or punctuation). I wouldn't be surprised if Japanese searchers do something similar. Chinese searchers don't need to as much anymore because we have a Chinese-specific tokenizer now that has a dictionary and does a pretty good job of breaking up long strings of characters into words. I assessed something similar for Japanese back in the day (T166731), but it wasn't good enough. I'd like to look at it again (T178923#4743673), both to see if it's gotten any better and to see if I can configure it or patch it with filters to make it better (which worked for Korean—see T178925), but time is a finite resource.

My immediate advice to get better results would be to use quotes when searching individual words or phrases—so "ちいさ" (with quotes) rather than just ちいさ. If you have multiple words that don't necessarily need to be together in the document you are looking for, break them up and optionally add quotes around them, such as "あお" "あい"—which finds あおはあいよりいでてあいよりあおし as the first result (a luckily fun example.. it was the longest title on the first page of the Category:Japanese hiragana).

As for this example:

ひさし振り should still yield 久し振り, 久しぶり or ひさしぶり

The only "optional" words/tokens we support are stop words, and we generally only support those for the language of the current wiki (and sometimes also a few of the most common stop words of English even if the language of the wiki isn't English). So, searching for the notable programming languages on English Wikipedia brings up the article "List of programming languages", even though the article doesn't contain the—not many articles contain no instances of the, but there are some!

So, I don't imagine we'll be supporting partial matches in "foreign" scripts any time soon.

On top of all that, Elasticsearch—the search engine on-wiki search is built on—has changed their license and we will eventually have to migrate away from it (see T272111: Elasticsearch, a CirrusSearch dependency, is switching to SSPL/Custom licence for more). This is also putting a damper on any heavy investment in custom development of language analysis tools until we know where that will be going. I'm still working on smaller tasks and in-progress plans, but I can't see taking on a big effort for processing Japanese on English-language wikis in the near or medium-term future.

Sorry not to have better news.

Declining this ticket given TJones' response above. Please reopen if necessary