Page MenuHomePhabricator

Allow hiragana searches to find katakana results and vice versa
Closed, ResolvedPublic

Description

There are a multitude of terms that may be spelled in either hiragana or katakana; the names of animals are a ready example that comes to mind. (e.g., オオカミ/おおかみ "wolf") Normal Japanese words may also be spelled in either hiragana or katakana for special effect. A comparison with uppercase/lowercase Latin is not perfect but sufficient.

The hiragana / katakana mapping seems like a straightforward operation, yet it is not happening in major search engines. (See T173650#3580309 for examples.)

[Description edited from comments on T173650.]

Example mappings:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
TJones subscribed.

Since major search engines don't do this, I'd like feedback from various communities (at least Japanese Wikipedia and English Wikipedia) on whether they'd want this as an always-on feature, or as an advanced search (T174064) keyword, (e.g., kana:オオカミ).

If this is as straightforward as it seems from the sources, this would be easy to implement as an always-on character mapping in Elasticsearch.

debt triaged this task as Medium priority.Sep 19 2017, 10:01 PM
debt subscribed.

We've reached out to a few lists to see if we can get feedback:

We recently got a suggestion via Phabricator[1] to automatically map
between hiragana and katakana when searching on English Wikipedia and other
wiki projects. As an always-on feature, this isn't difficult to implement,
but major commercial search engines (Google.jp, Bing, Yahoo Japan,
DuckDuckGo, Goo) don't do that. They give different results when searching
for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give
different *numbers* of results, seeming to indicate that it's not just
re-ordering the same results (say, so that results in the same script are
ranked higher).[2] I want to know what they know that I don't!

Does anyone have any thoughts on whether this would be useful (seems that
it would) and whether it would cause any problems (it must, or otherwise
all the other search engines would do it, right?).

Any idea why it might be different between a Japanese-language wiki and a
non-Japanese-language wiki? We often are more aggressive in matching
between characters that are not native to a given language--for example,
accents on Latin characters are generally ignored on English-language
wikis. So it might make sense to merge hiragana and katakana on
English-language wikis but not Japanese-language wikis.

Thanks very much for any suggestions or information!
—Trey

[1] https://phabricator.wikimedia.org/T176197
[2] Details of my tests at https://phabricator.wikimedia.org/T173650#3580309

Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation

Wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2017-September/088843.html
Wikitech-ambassadors: https://lists.wikimedia.org/pipermail/wikitech-ambassadors/2017-September/001681.html
Discovery: https://lists.wikimedia.org/pipermail/discovery/2017-September/001576.html

@Miya and @Takot are native speakers of Japanese and may be able to share some wisdom on this.

Also posted to:

A few notes from other discussions...

Wikitech-l:

  • suggestion to make it a option in advanced search
  • bonus feature: inexact matches get weighted a bit less in results (compare resume, resumé, and résumé on EN WP)

Japanese Village Pump:

  • a comment that it wouldn't be helpful on Japanese-language wikis because...
    • it is generally limited to animal names
    • some words differ only by kana, e.g., あみん (musician) and アミン (amine), both "amin"
    • hiragana is already used for sort keys, so search in hiragana can match katakana articles by way of hiragana sort keys
  • a comment saying that it may be useful when you forget, and for people who aren't very familiar with the writing systems
  • a comment that some electronic dictionaries do this for headwords; this fits with some of the ideas I have for phonetic searching in general (T174705)
  • a comment pointing out that there are some results across kana in major search engines and examples where it doesn't work in Wikipedia.
  • a "support" vote, which says that the occasional kana mixup, like "amin" above, is fine.
    • (this one and the previous one were harder to understand in translation, so if anyone can take a look and clear up any misunderstandings, that would be helpful).

English Village Pump

  • there's a comment with more examples that are different in hiragana and katakana

I'll continue to update this comment as more feedback comes in.

I think it would be easy to do a always-on everywhere implementation by defining a hiragana/katakana map in Elasticsearch and enabling it wherever it is wanted. Testing in Relforge would also be straightforward—with both automated testing for impact (re-running queries with kana and seeing how many change) and user testing with re-indexed snapshot.

If we want a more nuanced implementation—esp. a new keyword—that's a lot more work.

(non-native Japanese user here)

I agree with the idea that we should turn on this feature for sites that don't use hiragana / katakana natively (i.e. everything except sites in Japanese, Ainu and maybe a few obscure languages). One is probably aware of the conventions with hiragana vs katakana if one searches in Japanese projects, but if a non-speaker of Japanese is copying and pasting a phrase in hiragana / katakana and searches for it in e.g. the English Wikipedia, it would make sense to match results in both sets of glyphs since the reader might not be fluent with the conventions. They might also be reading old (pre-1945) texts which typically would only use katakana + kanji!

There are examples of both current false negatives (animal names, Toys R Us) and potential false positives ("amin", Naruto), and both concern and support for the mapping.

It seems like the best approach, based on the feedback so far, would be to enable the mapping on RelForge for, say, English and Japanese Wikipedias and Wiktionaries, and have people try them and see how well it works—checking for both accuracy/completeness of the mapping and the quality of the results.

My expectation would be that for non-Japanese wikis it's going to have a small enough effect that any false positives won't be a huge problem. For Japanese wikis, it's much less clear.

I won't be able to work on this in the immediate future (next few weeks), but I hope to get to it this year. If anyone else can work on it, that'd be fine with me, but I'm happy to work on it!

Whew! What a ride. This turned out to be much more complicated than anticipated for the Japanese analysis. I found three tokenization bugs, one of which depends on context in unexpected ways and so made me question my data collection, which led to me re-running everything... Anyway, because of the bugs in the tokenization, I recommend not deploying this for Japanese.

Looking at the very small effect size on English projects and the apparent usefulness for non-Japanese speakers, I wonder if we should apply this to other non-CJK languages, too. The problem is that it requires unpacking the analyzers for languages that have monolithic analyzers, though some third-party analyzers cannot be unpacked.

My recommendations:

  • Enable Hiragana to Katakana (H2K) mapping for English, as requested.
  • Do not enable H2K mapping for Japanese—it wasn't part of the original request, the community was ambivalent, enabling it exposes a number of tokenization bugs, it has a very large impact that may or may not be good.
  • For other languages: post to some of the Wikipedia and Wiktionary Village Pumps for (in search volume order) French, Russian, Italian, Swedish, Chinese, and Hebrew. If there is some enthusiasm for it, add it and test it as needed. I'll start with four posts to French and Russian WP/Wikt VPs.
  • File upstream bugs for tokenization problems.

More details on the analysis analysis for English and Japanese, the bugs (1, 2, 3), and options other than my recommendations above are on MediaWiki.

I'll follow up with links to the steps above as they are taken.

Change 389525 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Allow Hiragana to Find Katakana and vice versa

https://gerrit.wikimedia.org/r/389525

Bugs filed:

  • ICU Tokenizer: U+0370 and above affect tokenization of characters after whitespace: issue 27290
  • Standard tokenizer incorrectly tokenizes hiragana: issue 27291
  • ICU Normalizer adds spaces before certain non-combining dakuten and handakuten: issue 27292

Change 389525 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Allow Hiragana to Find Katakana and vice versa

https://gerrit.wikimedia.org/r/389525

The code has been merged, but not deployed. I've created T179945 to re-index of English-language wikis after the code is deployed, and added it to T147505.

@TJones - looks like the majority of the feedback on the language wiki's is favorable, nice job with the postings and investigation!

Thanks, @Deb!

I've added posts on Italian Wikipedia & Wiktionary, and Swedish Wikipedia & Wiktionary.

Closing this out, as no more work is needed, other than the deployment (T179945). Thanks, @TJones!