Page MenuHomePhabricator

Search box needs some normalization for Arabic Family languages
Open, MediumPublicFeature

Description

We have some langues such as Arabic, Persian, Urdu, Kurdish,... which uses common characters and they have similar geliphs with different Unicode number for example:
for ک (Kaf)
ك Arabic U+0643
ڪ Urdu U+06AA
ﻙ Pushtu U+FED9
ﻚ Uyghur U+FEDA
ک Persian U+06A9
for ی (ya)
ی Persian U+06CC
ي Arabic U+064A
ى Urdu U+0649
ۍ Pushtu U+06CD
ې Uyghur U+06D0
for ه (heh)
ہ Pushtu U+06C1
ە Kurdish U+06D5
ه Persian U+0647
we have these characters which have different Unicode number and different keyboard.
Now many users does not access to Persian keyboard or urdu keyboard by default in their OS (like windows xp, android (low versions), IOS ,...). so when they search for an article they can not find it in wikipedia searach box but it is existing in local characters.

For example if you search at fa.wikipedia for article ويليام شكسپير (characters are in Arabic ي , ك) you can not find it and the article in Farsi is ویلیام شکسپیر (characters are in Persian ی , ک).

for farsi please add a possibility for search tool to assume
U+064A or U+0649 or U+06CD or U+06D0 or U+06CC > U+06CC
U+0643 or U+06AA or U+FED9 or U+FEDA > U+06A9
U+06C1 or U+06D5 > U+0647


Version: unspecified
Severity: enhancement

Details

Reference
bz70899

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:47 AM
bzimport set Reference to bz70899.
bzimport added a subscriber: Unknown Object (MLST).

Yes, we have a same problem on ckb wikipedia. It can be useful.

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

(In reply to Andre Klapper from comment #4)

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

We need normalization for search box which is placed on the top pages.

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

I meant CirrusSearch

I'm having a hard time understanding the scope of this task. Could @TJones help? :-)

We need to specify the list of languages we are trying to do this for. The description mentions Persian, but alludes to Arabic, Urdu, Kurdish, Pushtu, and Uyghur, and the comments mention Sorani (ckb).

For each language, we'd need to figure out the exact mappings. (Persian is listed above, but I'd want to double check that the correspondences work in the other direction for each language—and there may be others glyphs that are relevant to other languages but not relevant to Persian, and so not listed here.) There may be more detail available in the github repos listed but I haven't looked closely.

Then we have to figure out which language analyzers are being used for each language. Arabic, Persian, and Sorani have their own analyzers. The fallbacks (T147959), though very imperfect, are the status quo, so we'd have to see what's going on there.

For each analyzer being used (possibly including the default), we'd need to unpack the built-in ES analyzer so we can modify it. Doing this for French and others has given unexpected results—generally not bad, mostly improvements, with the few regressions being readily fixable. Figuring all that out requires testing per language, and I'd really want to be careful the first time we did it to an Arabic-script analyzer.

After the unpacking, actually setting up the mapping is very little work.

Once that's done, the wikis in question need to be re-indexed. Arabic Wikipedia has already been done for BM25, and others may happen before we work on this, in which case the change going live could take a while—until the next time we re-index. (Though that's something we need to get better at being able to do, and doing it for the projects in a handful of languages is less effort than doing it for almost everything, as we are with BM25.)

Does that help?

TJones lowered the priority of this task from Medium to Low.Aug 27 2020, 8:04 PM
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:12 AM
Restricted Application added a subscriber: Huji. · View Herald TranscriptFeb 4 2022, 11:12 AM
TJones raised the priority of this task from Low to Medium.Jul 28 2022, 3:26 PM