Page MenuHomePhabricator

וי (U+05D5 vav, U+05D9 yod) doesn't find ױ (U+05F1 Yiddish vav yod)
Closed, ResolvedPublic2 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

The search doesn't find anything.

What should have happened instead?:

The search should have found https://www.wikidata.org/wiki/Lexeme:L588703

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

This seems to affect all of the Yiddish ligatures:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as Low priority.Apr 15 2024, 3:25 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
TJones set the point value for this task to 2.
TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

In reading up on the ligatures, I found another ligature (yod-yod-patah ײַ) that has several variants, one using a ligature from above (double-yod + patah ײַ), one with separate characters (yod + yod + patah ייַ), and a less common variant with the patah in the middle (yod + patah + yod יַי). It looks like icu_normalizer already converts the single-character form (ײַ) to one using the double-yod ligature (ײַ).

Even using an insource regex query, I can't separate yod-yod-patah (ײַ) and double-yod + patah (ײַ) on-wiki—there may be another level of mapping happening in the browser or another layer of the Mediawiki software.. though I can type them separately here, so it's not the browser. Regex searches for either return 1392 results on Yiddish Wikipedia, while the more common decomposition yod + yod + patah (ייַ) gets 629 results, and the less common yod + patah + yod (יַי) gets 19. So all typable variants are in use.

I added a character filter mapping the ligatures to the component pieces, taking yod + yod + patah (ייַ) as the canonical decomposition of variants of yod-yod-patah (ײַ).

I made the mapping global, since the difference is usually invisible to the reader or searcher, and they may have copied something from some other source without realizing it. In testing, examples of the Yiddish ligatures showed up in my small samples from Alemannic, German, Hebrew, and Russian Wikipedias.

Of course, there were lots of (visually identical) mergers in my Yiddish Wikipedia sample, as expected and hoped for. Hundreds for most of the variants, but only a (non-zero!) handful for the oddball yod + patah + yod (יַי), and zero for yod-yod-patah. I'm leaving the yod-yod-patah mapping in place, though, as a backstop just in case it ever shows up. (I can use it on the command line, so it's possible to get it to Elasticsearch in some circumstances—might as well do the right thing if we see it.)

Pretty much the same write up is on Mediawiki.

Patch incoming shortly.

Change #1022030 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enable Yiddish Ligature Mappings

https://gerrit.wikimedia.org/r/1022030

Change #1022030 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Enable Yiddish Ligature Mappings

https://gerrit.wikimedia.org/r/1022030