Inappropriate/broken redirecting of Japanese in search
Closed, DeclinedPublic

Description

Undesirable behavior
ぺろぺろ [peropero]redirects toべろべろ [berobero](ignoring the dakuten?)
させん [sasen]redirects toさーせん [saasen](discarding the long vowel sign?)
Inconsistent behavior
まーす [maasu]does not redirect toます [masu](the long vowel sign isn't as optional as it appears to be?)
ブーン [buun]redirects toフン [fun](or not??)
Bad behavior
デス [desu]does not redirect toです [desu](katakana→hiragana)
ほいっぷ [hoippu]does not redirect toホイップ [hoippu](hiragana→katakana)

I don't remember when the "undesirable behavior" started but it has bugged me recently on en.wikt when searching for entries that may or may not exist.

As far as I can tell messing with Preferences > Search doesn't solve anything.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 20 2017, 6:14 AM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptAug 20 2017, 9:25 AM
EBjune added a subscriber: TJones.Aug 21 2017, 4:56 PM

@Suzukaze-c:
Is this on your website? Which MediaWiki version? (= MediaWiki-Search used by default.)
Or is this on some Wikimedia site? (= CirrusSearch used)

For more information how to include enough information, see https://mediawiki.org/wiki/How_to_report_a_bug - thanks! :)

@Aklapper: It's on the English Wiktionary.

Suzukaze-c updated the task description. (Show Details)Aug 23 2017, 8:34 AM
TJones added a comment.EditedAug 24 2017, 3:07 PM

@Suzukaze-c, thanks for the report of undesirable/unexpected behavior. I’ll try to explain some of the unexpected behavior and explore some of the inconsistent behavior to figure out what’s going on and what we should do about it. I also have some suggestions for search patterns that may improve your results.

In general, we employ fairly aggressive character folding when indexing and searching. Character folding generally converts characters to a “more basic” versions, and we use a plugin for folding that comes with Elasticsearch, the open source search engine underneath the CirrusSearch MediaWiki extension.

That Elasticsearch plugin is in turn based on a standard library built by the ICU Project (“International Components for Unicode”). It’s can be a little aggressive and it has some clear errors in it (as a linguist, I find some of the normalizations of phonetic symbols are atrocious), but it’s an enormous project and tries to cover a huge range of characters and does a pretty good job overall.

Some folding examples (see Figures 1 and 2 in the Unicode Normalization report for more):

  • ℍ → H
  • [NBSP] → [SPACE]
  • ① → 1
  • カ → カ
  • ︷ → {
  • i⁹ → i9
  • ㌀ → アパート
  • ¼ → 1/4
  • dž (one character) → dž (two characters)

The general philosophy for folding in CirrusSearch is that for a project in a given language, we don’t fold characters that are distinct in that language, but we do fold everything else. So, in English we fold everything. In Swedish (T160562) we don’t fold å & ä to a, or ö to o, but fold everything else. In Russian, we don’t fold й to и (but do fold ё to е; T124592). We have a long-term plan to properly configure the folding exceptions for every language, but that’s not yet done, and is currently driven mostly by Phab tickets as they come up.

If you are using the search box in the upper right corner of the page (upper left for right-to-left languages), it is called the “Go Feature” and it uses a particular query analysis process. It does not break the query into separate words (“tokenization”) but it does do ICU character folding. If it finds only one folded title match it takes you directly to that article. If there are multiple folded matches, the rules are a little more complex, but I believe it will take an exact unfolded match over any other, and in the case of variants that only differ by case, it seems to prefer the one that’s closest to lowercase, but I’m not 100% sure.

As an extreme example of folding, searching for Ãłḇɛʀƭ Ɇịȵṧțǝȉɲ on English Wikipedia takes you right to Albert Einstein. More reasonable examples with one more or one less diacritic than expected work, too.

For inexperienced users and users working in languages or writing systems they aren’t familiar with, this is generally a good thing. Experienced users do complain about false positives when an accented form matches an unaccented or differently accented form. And of course there’s the problem of searching for something that doesn’t exist, only to be given a near match.

A slightly ambiguous example: both umlaut and Umlaut are folded to umlaut. However, for each there is an exact match available. Searching for ümlaut, however, doesn’t give an exact match and since there are multiple matches available (and none that differ by case in the right way), you get rolled over to the Special:Search results page.

Once you get to Special:Search, we have other paths of query analysis available. The “text” field indexes and matches on text that’s been parsed by a language-specific module (for the few dozen languages for which they are available). It does things like dropping common stop words (e.g., “the”, “of”, “a”, “an”, etc.) and stemming (converting “hope”, “hopes”, “hoped”, and “hoping” to all be indexed as “hope” so they find each other), in addition to the ICU folding.

We also index the “plain” field, which indexes the text without stemming or stop words. It does apply ICU character folding, but also keeps a copy with the original characters. The “plain” analysis also uses the ICU tokenizer, which sometimes does something better (or at least different) when splitting “words” for languages that don’t use spaces, like Japanese, Chinese, Thai, and others.

The text field encourages inexact matches, and the plain field encourages exact matches, but also allows for inexact matches. Thus resume, resumé, and résumé all match each other, but exact matches from the plain field, especially in the title, usually rank a bit higher.

With all that in mind, here’s how these examples get analyzed and indexed. I’ve include quotes around the tokens to make them a bit clearer, and to show that some tokens are empty.

I believe the “Near Match ASCII Folding” is the most relevant, since it seems you are searching with the Go Feature.

TermNear Match ASCII FoldingEnglish Text Field AnalysisEnglish Plain Field Analysis
ぺろぺろ"へろへろ""へ" "ろ" "へ" "ろ""へ" "ぺ" "ろ" "へ" "ぺ" "ろ"
べろべろ"へろへろ""へ" "ろ" "へ" "ろ""へ" "べ" "ろ" "へ" "べ" "ろ"
させん"させん""さ" "せ" "ん""さ" "せ" "ん"
さーせん"させん""さ" "" "せ" "ん""さ" "" "ー" "せ" "ん"
まーす"ます""ま" "" "す""ま" "" "ー" "す"
ます"ます""ま" "す""ま" "す"
ブーン"フン""フン""フン" "ブーン"
フン"フン""フン""フン"
デス"テス""テス""テス" "デス"
です"てす""て" "す""て" "で" "す"
ほいっぷ"ほいっふ""ほ" "い" "っ" "ふ""ほ" "い" "っ" "ふ" "ぷ"
ホイップ"ホイッフ""ホイッフ""ホイッフ" "ホイップ"

In the case of ぺろぺろ / べろべろ we can see that the folded versions are the same, so they match. This is generally a feature, not a bug. It usually allows inexact matches (that do not involve characters that are distinct in the “host” language of the wiki) to work. In this particular case, it may be that normalizing both ぺ and べ to へ is a case of the ICU folding library being too aggressive and thus a mistake.

ー is dropped by the Go Feature / Near Match ASCII Folding because it normalizes to either nothing or to an empty string. So, you can search for doーgcーaーtcーher and get rolled over to the entry for dogcatcher.

In the general case of ー, the ICU normalizer drops ー when it occurs adjacent to katakana, but indexes it as an empty token when it occurs between hiragana. For the Go Feature, these are effectively the same, because “nothing” and “an empty string” are the same when combined as part of a larger string. For the text field in particular, though, it’s odd because it does index the empty string as a token. (Other characters also get indexed that way, which is strange.) So, that’s weird.

Anyway, させん matches さーせん and ブーン matches フン because they are normalize the same by the Go Feature, ignoring ー.

The treatment of まーす and ます seems inconsistent, but the difference is a bit subtle. It’s like the umlaut/Umlaut/ümlaut example above. まーす gets regularized to ます, but so does まず. Since there’s not just one result for the regularized form (ます) and the query (まーす) is not an exact match for either potential match (ます or まず), you get rolled over to Special:Search results.

As for normalizing across katakana and hiragana (デス / です and ほいっぷ / ホイップ), I think that’s a new feature request, and it doesn’t seem to be something that even Google Japan does. デス gets ~40M results, while です gets ~4.3B!

There are a couple of more complicated search patterns you can use that should help you find what you want. If you use Special:Search instead of the Go Feature, you always get a list of results and never get redirected to an entry.

  • You can go to the upper-right search box and just hit enter to be taken to the Special:Search page, with an empty search box and no results.
  • You can also put a tilde (~) in the upper-right search box before your query to block the Go Feature and force results on the Special:Search page. e.g., ~ぺろぺろ. However, for people who want to create an entry or article, this can be annoying because it will suggest that you can create the page "~ぺろぺろ”, which is not what most people want, so the previous option is better.
  • Putting quotes around your query blocks the case folding on the Special:Search page (but not in the upper right/Go Feature search), so searching for "べろべろ" gets 3 results, while searching for "ぺろぺろ" gets 0. (You can also use the tilde in the upper right/Go Feature box with quotes. e.g., ~"べろべろ” .) Using quotes does seem to block the article creation link, though.
  • If you are looking particularly for entires/articles only by title, you can combine quotes with the intitle: keyword. intitle:"べろべろ" gets exactly one result.

It’s always difficult and complicated to find an equilibrium between the needs of inexperienced users—who we often can’t ever really communicate with; some won’t even read help pages!—and the needs of sophisticated users and editors. The best balance seems to be to make the default search do something reasonably expansive for the naive users and provide tools that help sophisticated users do the things they want and need to do, albeit somewhat more verbosely and thus less efficiently at times. Hopefully some of the search syntax tricks above will help in your own work.

Please let me know if you think there’s anything else we should be doing, or if any of my explanations aren’t clear.

debt awarded a token.Aug 24 2017, 5:45 PM
debt added a subscriber: debt.
Suzukaze-c added a comment.EditedSep 3 2017, 2:05 AM

Thank you for your explanation of Elasticsearch's behavior, and tricks that can be used to get desired results.

My thoughts:

  • There are a multitude of terms that may be spelled in either hiragana or katakana; the names of animals are a ready example that comes to mind. Normal Japanese words may also be spelled in either hiragana or katakana for special effect. A comparison with uppercase/lowercase Latin is not perfect but sufficient.
  • I suppose stripping of dakuten is as sensible as allowing users to get from nhiet do to nhiệt độ, as someone might easily confuse ば and ぱ. However, as someone fairly acquainted with Japanese, it does not seem as sensible.
  • On the other hand, removing still seems silly to me.
TJones added a comment.Sep 5 2017, 3:13 PM

The hiragana / katakana mapping seems like a straightforward operation, yet it is not happening in major search engines. I used オオカミ/おおかみ ("wolf") as an example and searched Google, Bing, Yahoo Japan, DuckDuckGo, and Goo. (Both Google and Google Japan give the same results.)

queryGoogleBingYahoo Japan
おおかみ2,760,000516,000,0002,860,000
オオカミ19,600,00017,900,00019,800,000
オオカミ おおかみ33,800,00083,900,00031,600,000
オオカミ -おおかみ653,00017,400,000685,000
おおかみ -オオカミ450,0006,910,000430,000
オオカミ OR おおかみ19,700,0002,320,00019,800,000

DuckDuckGo and Goo don't show results counts, but both obviously give different results for おおかみ and オオカミ.

On the first page of Google hits, おおかみ appears only once on the results page for オオカミ, and オオカミ appears only a few times on the results page for おおかみ. (The Wikipedia article for オオカミ is always the top result; it includes both forms.)

Something weird is happening with the queries オオカミ おおかみ, since all give more results when the intersection should be smaller. I added オオカミ OR おおかみ and it gets fewer results, even though Google advanced search page says that's really how you do "OR".

I also checked to see whether the search engines highlight one form when you search for the other. Results were mixed:

yesGoogle / Google.co.jp
noBing
yesYahoo Japan
noDuckDuckGo
yesGoo.ne.jp

Anyway, it's clear that big, for-profit search engines treat hiragana and katakan differently, so there must be a reason for it, though I cannot see what it is.


As for stripping dakuten, note that it is on English language projects. On Japanese Wikipedia, they get different results. (ぺろぺろ 57, べろべろ 82).

I don't know what's up with stripping ー either, though a Japanese-specific language processor for Elasticsearch, Kuromoji, does it as well.


If I've explained all the apparent inconsistencies and shown that searching Japanese words on English is working as intended, can we close this ticket?

I suggest opening another ticket if you want to make a new feature request for the hiragana/katakana mapping functionality. I'd want to see some community discussion to see if we could uncover the reason why it doesn't happen elsewhere. Alternatively, it would make for an interesting advanced searching function, though I'd have to dig into some technical details to determine whether it was plausible.

You know more than me; please deal with this ticket as you see fit. I shall open a new ticket for hiragana←→katakana.

debt closed this task as Declined.Sep 19 2017, 10:02 PM

We'll go ahead and close this ticket out, as we've asked for additional feedback here: T176197#3619678

TJones added a comment.Nov 6 2017, 4:52 PM

FYI: my recommendation over on T176197 is to enable hiragana-to-katakana mapping for English, but not Japanese because it runs afoul of a couple of ugly tokenization bugs. We'll also look into whether the French- and Russian-language communities feel they might benefit from this; if so, we may expand beyond those two.