Page MenuHomePhabricator

Enable ICU folding on Hebrew wikis
Closed, ResolvedPublic

Description

Author: hhielscher

Description:
The search shouldn't take Combining diacritical marks into account.
http://en.wikipedia.org/wiki/Combining_diacritical_mark

e.g. searching Александр Сергеевич Пушкин should reveal pages with Алекса́ндр
Серге́евич Пу́шкин as well as pages with Александр Сергеевич Пушкин
on the other hand searching for Алекса́ндр Серге́евич Пу́шкин should also find
pages that contain only Александр Сергеевич Пушкин without accent


Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:17 PM
bzimport set Reference to bz1836.
bzimport added a subscriber: Unknown Object (MLST).

a.lukyanov wrote:

Yes, it would be very useful — for instance, for Russian words (which are
optionally accentuated with the "combining acute accent"), and also for Arabic
and Hebrew words (where vowels are optionally indicated with marks over/under
the consonant letters).

This is essential for Vietnamese, in which most words have accent marks. New
users expect the search system to strip the diacritical marks (and also
understand Đ/đ ↔ D/d), but when it doesn't, the user is led to believe that we
don't have the article they're looking for.

Perhaps the search function should ignore diacritics in article titles when the
user has entered a query that contains no diacritics. If the user has entered in
diacritics, the software should respect that. It would also be nice if there
were a MediaWiki message in which a list of diacritics could be customized per
wiki or locale, since different languages distinguish letters and diacritics
differently.

I've filed a separate Bug 5752 for the issue I described in Comment 2, since
article titles at vi: use precomposed characters, which should nonetheless be
converted to the base ASCII characters when searching.

  • Bug 5752 has been marked as a duplicate of this bug. ***
  • Bug 5752 has been marked as a duplicate of this bug. ***

Changing summary to include the issue discussed at Bug 5752, which Brion wants
to merge with this bug.

hhielscher wrote:

About precomposed characters: MacOs X is avoiding the problem by using a special
kind of Unicode for filenames (UTF-8-MAC), where precomposed characters are
always converted to their composed variants.

hhielscher wrote:

see Unicode Standard Annex #15: Unicode Normalization Forms for details:
http://www.unicode.org/reports/tr15/

We know what Unicode is, thanks. :) MediaWiki already transforms all input
to NFC and includes a normalization conversion library built-in.

rainman wrote:

Fixed in Lucene Search 2. Diacritics are stripped, Đ-đ has been set as alias to D/d in Vietnamese. This also includes Hebrew pointing.

If you feel that stripping all diacritics is wrong for your language, reopen this bug.

reopened the task.

apparently, something was changed in the search machinery (actually, afaik, _a lot_ have changed), and now the search is diacritics-sensitive again.

REPRODUCTION:
1 - go to hebrew wikitext ( https://he.wikisource.org )
2- search for "חַרְבָּא" (a word including some hebrew diacritics)
3 - notice 57 results
4 - search for "חרבא" (same word sans the diacritics)
5 - notice 104 search results (it seems the results are mutually exclusive)

DESIRED RESULT:
both searches should yield the same result set; this result set should include any diacritic'ed form of the word, including naked, so it should include at least 161 results, optionally some more (when the word was maybe mis-diactitic'ed).

peace.

Note that the search is almost non functional in he.wikisource (this affects mainly wikisource because its heavy use of diacritics).

eranroz added a subscriber: TJones.

Trey did some interesting analysis of the affects of stripping diacritics in the search:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia

Based on Trey conclusion I think it may be handeld as part of reindexing of all wikis in the BM25 task ( T139575 ).

Currently we only enabled ICU folding for en, fr and greek wikis, ASCII folding has no effect on hebrew as it supports only latin characters. The corresponding analysis from Trey is https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Upgrading_ASCII_Folding_to_ICU_Folding_for_French_and_English

Note that ICU folding is very new and I don't understand why something broke recently, this is extremely confusing because we reindexed hewikisource last week... but I can't find anything in the code that relates to folding of hebrew diacritcs. In fact I think the regression dates back to when cirrus replaced Lucene Search 2.
I checked on he.wikipedia.beta.wmflabs.org which has not been reindexed recently and the hebrew diacritics are not folded.

I kept the original index settings for hewiki used before the reindex (P4621) in case someone wants to double check (maybe I overlooked something...)
If this is a recent regression I have no clue how this could have happened...

Anyways, I'd suggest to activate ICU folding for hebrew wikis, if everyone is OK it should be pretty straightforward to do.
Note that we usually do not fold the input search query except in rare cases, e.g. for french:

  • searching for élément does not find element
  • searching for element finds élément

Applied to hebrew

  • searching for חַרְבָּא will not find חרבא
  • searching for חרבא will find חַרְבָּא

I can setup a test index somewhere in labs if someone is willing to evaluate these new settings before applying them to production.

If it's as simple as setting a flag to turn ICU folding on and doing a reindex, then we can do this soon.

Deskana renamed this task from Strip combining diacritical marks and convert precomposed characters when searching to Enable ICU folding on Hebrew wikis.Dec 15 2016, 11:05 PM
Deskana moved this task from needs triage to Up Next on the Discovery-Search board.

OK, I'll create few test indices (probably he.wikisource and he.wikipedia) today or early next week.
Due to code freeze the change may not land to production very soon, it should give time to experiment and test if these new settings have any major drawbacks on Hebrew.

Change 328170 had a related patch set uploaded (by DCausse):
Enable ICU folding for hebrew

https://gerrit.wikimedia.org/r/328170

Two indices are available for test in labs:

NOTE: these wikis run in our relforge test instance, it means that you can only search, you can only list search results not actually clicking on them.

Please let me know if you encounter any undesired behaviors related to diacritics.

@eranroz @Kipod Can you please look at @dcausse's comment above and test for us? If things are working as you'd expect, then knowing that helps us a lot. :-)

@Wikitiki89 Perhaps you could also help us test this, given the conversation we had on T132637? :-)

Two indices are available for test in labs:

NOTE: these wikis run in our relforge test instance, it means that you can only search, you can only list search results not actually clicking on them.

Please let me know if you encounter any undesired behaviors related to diacritics.

Yay it works! Searching for "חרבא" (without diacretics) gives also results of "חַרְבָּא" (e.g with diacretics):
http://hewikisource-relforge.wmflabs.org/w/index.php?title=מיוחד:חיפוש&limit=500&offset=0&profile=default&search=חרבא

@dcausse thank you for the quick fix

@eranroz @Kipod Can you please look at @dcausse's comment above and test for us? If things are working as you'd expect, then knowing that helps us a lot. :-)

looks good to me. as per @dcausse comment from dec. 15, search for un-diacritics search word find them all, and search for specifically diacritic'ed word finds only occurrences of the diacritic'ed word, as typed (and all of those!).
i think this is the best possible solution, and it's great that it's technically possible.

as to the comment

In fact I think the regression dates back to when cirrus replaced Lucene Search 2.

i am almost certain that this is exactly so.

thank you so much for the prompt solution of the problem.
i will ask a friend on arwiki if icu folding should be applied there too.

peace.

Change 328170 merged by jenkins-bot:
Enable ICU folding for hebrew

https://gerrit.wikimedia.org/r/328170

@Wikitiki89 Perhaps you could also help us test this, given the conversation we had on T132637? :-)

I seems to work for the few random searches I just tested out. However, I would personally have preferred that searching for הרבה and הַרְבֵּה (without quotes) should both ignore diacritics, because searching with quotes can always be used to force exact diacritic matching. Another potentially useful possibility would be to have a search term with a diacritic return results that have at least that diacritic, but possibly more (for example, searching for הרְבה would find הַרְבֵּה and הִרְבָּה, but not הרבה or הָרַבָּה). This would all be useful for Arabic as well.

@Wikitiki89 I'm glad it's an improvement, at least. Since we enabled ICU folding, which was the scope of this task, I'm closing this task as resolved. Please do feel free to file tasks for any requests you have for further modifications, and they can be considered. Thanks for helping us out! :-)