Page MenuHomePhabricator

Cannot search partial Javanese script titles
Open, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Search for book title "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀ꦭꦩꦶ" in Commons works (the full title), found the PDF
  • Search partial "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀" or "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁ" or "ꦥꦼꦥꦼꦛꦶꦏ꧀" or any parts of the full title won't work

What happens?:

  • First identified 10 years ago, T46350, marked wont fix, since last time was during migration from Lucene to Cirrus Search.
  • I identified there was a problem with scriptio continua nature of Javanese script (no word marker)
  • T58505 Cirrus Search ticket was closed as solved

What should have happened instead?:
The Commons search et. al should be able to find parts of the full title.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

A bit background on the way the script is written:

  • Scriptio continua is a hassle to display in web, because there's no obvious line break. Therefore, in projects such as Wikisource, if not handled properly, would break the page view of the transcribed documents. (become too wide)
  • Using certain keyboards, including the way we handle the problem is, to automatically insert ZWS (zero-width space) after certain character (e.g. comma, period, etc. ) I can gave you the full list. Therefore, the line break would still works, except in very rare cases where there's no occurrences of those characters (thus the ZWS not auto-inserted). [the zws doesn't always equal to the Latin space]
  • AFAIK ZWS is not supported in page titles, (e.g. when I upload books with Javanese script titles that contain ZWS), so none of the titles in jv wiki projects contain ZWS, and thus the cirrus search won't be able to know the word delimiter.

Event Timeline

I have a bit of technical info on ZWSs.

There are pages with ZWSs in their titles, though the titles are redirects. I found one on Commons and two on English Wikipedia (and zero on Javanese Wikipedia). I had to use regexes to search, so results are incomplete.

I was also able to create a new page title with ZWSs:

Note that you usually can't see the ZWSs, but the are there in the URLs (%E2%80%8B).

Cirrus does treat ZWSs as spaces (at least the standard and ICU tokenizers split on it). However, adding them to titles does mean that searches without them would fail. So, searching for zerowidthspacetest on Mediawiki doesn't find my test page (searching with capitals works, but that's because we split on CamelCase in English-language contexts—which wouldn't help with Javanese).

I don't know what kinds of regularization and normalization happen during file upload, and I can imagine a well-intentioned automated process that removes ZWSs (though I think we agree that it should convert them to spaces).

If there is a list of punctuation marks where ZWSs are automatically inserted by Javanese-savvy systems, we can try to replicate that in the language analyzers in Cirrus. Since the punctuation is specifically Javanese, I would argue that we should enable it either everywhere, or at least on Javanese (obviously relevant) and English (often used as the default on multilingual/multi-script sites like Commons). I could imagine a global_punctuation filter that adds spaces after punctuation in any script where the standard or ICU tokenizer doesn't recognize them as punctuation. (I'm tempted to throw a \p{P} in there and call it a day, but those Unicode regex properties are never quite 100% what you expect them to be.)

Thank you for the insight.

I had this issue as well when I created my Javanese transliterator: where to insert ZWS, so the page break would work (semi-)naturally.
It's not always possible to put the ZWS on the word ending (as in Latin), due to the rule of syllable ends in "virama" + syllable start with "consonant" = merged together. (If whoever reads this are not familiar with how Indic-derived scripts work, think of it like French liaison, only in writing). So for a phrase like "pethikan saking", the components are written as such: /pe thi kka nsa king/ ꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁ, compare to the form of /nsa/ when it's separated by a ZWS: ꦤ꧀​ꦱ, which is totally different. Therefore putting ZWS after a virama is generally frowned upon

Luckily, there are at least 3 consonant endings that are not using virama, namely: -ng, -h, and -r. So for words like "saking", "omah", and "anyar" for example "ꦱꦏꦶꦁ", "ꦲꦺꦴꦩꦃ", "ꦲꦚꦂ", I put a ZWS after each "ꦁ", "ꦃ", and "ꦂ". (Although those three are merely syllable-ending, not strictly word endings, so adding them to a word like "angkringan" ꦲꦁ​ꦏꦿꦶꦁ​ꦔꦤ꧀​ wouldn't change anything visually, but would technically break them into 3 tokens: "ang", "kring", and "ngan".)
So those three, plus some others, such as "꧈​" (comma-like separator), "꧉" (period-like separator) - although their occurrences in titles are quite rare, could be added to the "global_punctuaction" for the tokenizer, that would be great.

MPhamWMF triaged this task as Medium priority.Feb 27 2023, 4:28 PM
MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.