Page MenuHomePhabricator

Search should normalize Niqqud diacritics in Hebrew characters
Closed, ResolvedPublic

Description

Author: eitan.etz

Description:
In Hebrew there is special characters used for vowels called "nikud". When there is word with this characters on wiki the search find this word just if you enter the word with the "nikud", and almost all people search without this.


Version: unspecified
Severity: enhancement

Details

Reference
bz69361

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:35 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz69361.
bzimport added a subscriber: Unknown Object (MLST).

Thanks for taking the time to report this!

Could you please provide one exact example here with exact steps to reproduce, so anybody else could follow these steps? Thanks in advance!

eitan.etz wrote:

For exmpale: if you search the word "הגבר" in this book:
https://he.wikisource.org/wiki/%D7%90%D7%95%D7%93%D7%99%D7%A1%D7%99%D7%90%D7%94_%D7%A9%D7%9C_%D7%94%D7%95%D7%9E%D7%A8%D7%95%D7%A1_%28%D7%98%D7%A9%D7%A8%D7%A0%D7%99%D7%97%D7%95%D7%91%D7%A1%D7%A7%D7%99%29/%D7%A9%D7%99%D7%A8_%D7%A8%D7%90%D7%A9%D7%95%D7%9F
you will found nothing, beacose the word in this book is "הַגֶּבֶר". This is the same word, but most of peapole will search without the marks on the letters. Thanks!

Short version: this should be resolved when T167058 is closed.

This should be taken care of when the new Hebrew language analyzer (HebMorph) is deployed and the Hebrew wikis are re-indexed. HebMorph accounts for niqqud (as well as doing additional stemming). You can read waaaay too much about HebMorph in my write up, or for the very short version see T162741.

Re-indexing Hebrew wikis is tracked in T167058. We have to deploy the plugin first (T167057), which has been delayed by a bit by upgrades to Elasticsearch, but more by the complexity of the plugin deployment, which is worse for HebMorph because it has an external dictionary file. That should all be improved by T158560.

In the meantime, at least for now, you can try out a demo of the new language analysis:

  • http://he-wp-hebmorph-relforge.wmflabs.org/
  • The demo only has the index, so it shows results and snippets, but you can't follow any links to articles.
  • The demo will eventually be recycled, but it's up as I post this.

Generally, searching for x doesn't find y means that searching for y doesn't find x. So.... searching for הַגֶּבֶר turns up lots of results for הגבר.

And if I add on a specific clause ( "אֲשֶׁר יִבְטַח בַּה' וְהָיָה ה' מִבְטַחוֹ") to make the specific results with niqqud show up in the results snippet, searching for הגבר finds הַגֶּבֶר.

debt triaged this task as Medium priority.Jun 28 2017, 6:09 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt subscribed.

Moving to Up Next, as @TJones notes, this *should* be fixed with the deployment of T167058.

I believe this is fixed now. Searching Hebrew Wikipedia for הַגֶּבֶר gives results that include הגבר. The opposite is also true, though Wikisource is a better place to test it out, since niqqud are more common there.

debt claimed this task.

Cool, thanks for verifying, @TJones