Page MenuHomePhabricator

Crimean Tatar Transliteration doesn't handle mixed script words
Closed, DeclinedPublic

Description

Crimean Tatar transliteration doesn't handle mixed script words—should it?

I intentionally put a pre-filter on the Crimean Tatar transliteration so that only words made up of letters in the Crimean Tatar alphabet we're transliterating from would get transliterated. For example, there's no W, so any Latin word with a W in it should not get transliterated since they are very likely foreign words or names. More specifically, when transliterating from Latin to Cyrillic, only words composed entirely of Crimean Tatar Latin characters are transliterated.

However, some letters in Cyrillic and Latin look more or less the same, depending on the font—e.g., Cyrillic а,е,о vs Latin a,e,o—and so they can be accidentally mixed in the article text. When this happens, the mix of Cyrillic and Latin letters is blocked from being transliterated (in either direction).

Examples:

  • Lаtin Cyrilliс Kиpил (bold letters are not in the script you would think they are in).
  • Real world examples on crh wiki include Ümer İpçi—the page appears to be in Latin text, but searching for Latin e on the page gives me 282 hits (including text in my English UI), but searching the page for Cyrillic е gives 152 hits, including one in the article subject's name in the first sentence of the second paragraph.
  • It happens in article titles, too: Sоllar, Sadоvоye, Krakоv—all the o's in the article titles are in Cyrillic.

Simply leaving the Cyrillic letter is possible, but can cause errors. Latin e is sometimes properly transliterated as Cyrillic е, and sometimes as з. Letting a wrong-script Cyrillic е remain in transliteration could cause mistakes.

Converting letters to the "right" script and then transliterating is possible, but not always easy, since words can be close to 50/50 Latin/Cyrillic. More complex algorithms are possible, because most characters can't be confused (e.g., Kиpил is probably intended to be Cyrillic because и and л are unambiguously Cyrillic). And there are always going to be exceptions, like KoЯn.

Possible approaches include:

  • Leave it as is, because this is a problem for the CRH WP community to clean up; this could confuse readers who don't understand why words that look like they are in Latin don't transliterate. This approach could also be supplemented with a tool to identify mixed-script words. (I used a prototype tool that isn't yet up to the task to find some examples.)
  • Transliterate mixed-script words with the existing rules, and wrong-script letters will generally be preserved (but could also interact unexpectedly with other letters); this could also confuse readers because two instances of a word (one mixed script and one not) could get transliterated differently.
  • Try to guess the intended script of mixed-script words (by proportion of characters of each type or presence of specific characters) and map the wrong ones to the right script before transliterating. This is more complicated, but would try to give the reader what they expect; unexpected errors are to be expected.

I will post this to the CRH WP Village Pump to get more feedback there.

Event Timeline

Added link here in update to announcement of transliteration being enabled on the Village Pump.

Mixed script words appeared mistakenly in some articles during early years of our Wikipedia. I hope, that there will not be any new pages with mixed script words any more, so we can let transliterator be as it is, and fix the problem in our articles.
If you can help with automatically fixing this, it will be very nice of you.

I've been testing my mixed-script detection/correction tool on crh.wikipedia in my non-work time and making lots of corrections, and I will continue to do so from time to time. If that's good enough, then the transliteration tool itself doesn't need to address it.