Page MenuHomePhabricator

Crimean Tatar/crh transliteration should not block on "km²"
Closed, ResolvedPublic


I added a "feature" to the Crimean Tatar transliteration that blocked the transliteration of tokens that had non–Crimean Tatar characters in them (like the name "Stewart"—there's no transliteration for "w", so it would transliterate as "Стеwарт", with the w still in the middle of the word). Similar situations exist for Cyrillic-to-Latin transliteration for Russian words with non–Crimean Tatar Cyrillic characters.

This was intended to minimize the need to use language converter–blocking -{markup}-, but it backfired here, and it's probably too hard to block in the general case, so it should be removed.

Event Timeline

Change 424738 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] CRH Transliteration Pattern Matching Fixes

If words like ‘Stewart’ are used anywhere in Crimean Tatar Wikipedia, they need to be either turned off from transliteration manually or phonetically transliterated (also manually by the editors). Сатурдай Нигхт'с Алригхт фор Фигхтинг, We Аре тхе Джхампионс, Лове Ыс а Джриме etc. are also vastly incorrect transliteration, so if there’s a Стеwарт in there, it doesn’t make much difference (if this article is what task is about). It would be like reading those words letter for letter in English.

@stjn—somehow I missed your comment. Sorry for the late reply.

I originally tried to implement something that would not try to transliterate words that couldn't be Crimean Tatar because they had letters in them that don't occur in either the Crimean Tatar Latin or Cyrillic alphabets. It automatically prevented transliteration of "Stewart" because of the w, but also blocked "km²", because of the ². It was intended to make things easier for editors, but there are too many combinations of non-letters to do a good job, so I've removed that logic in the patch above.

We do something similar to automatically block transliteration of Roman numerals to Cyrillic, but I'm beginning to think that it is too complicated and we should just rely on editors to explicitly block them manually.

Yes, that is probably a preferred option with all non-standard text that should not be transliterated. Having names copypasted in Latin converted to Cyrillic is bad whether or not they contain ‘w’ or other bad symbol or not, because Cyrillic readers won’t read ‘Сатурдай’ like /ˈsætədeɪ/ either. The simplest and wisest decision is to have these transliterated like they are transliterated regardless of letters and do an outreach about the fact that you need to put text like that in special tags so that it would be readable for other writing systems.

do an outreach about the fact that you need to put text like that in special tags so that it would be readable for other writing systems.

I did post to the Crimean Tatar Village Pump about the transliteration feature and the special markup, and Don Alessandro translated it there. It's probably not enough to get everyone familiar with the process, but it's a start. The markup documentation is not available in Crimean Tatar or any other Turkic language, and the Russian transliteration is only 39% complete. Fortunately the most basic version of -{ }- is pretty easy to understand.

Now 100% complete. Sadly, don’t have knowledge of Crimean Tatar.

Whoa, dude! That's awesome! Thanks! Based on English Wikipedia that will benefit some Crimean Tatar speakers, and Russian speakers, too, of course. I'm always impressed with how many people in the wiki communities are willing to jump in and do stuff that needs doing, even though I know that volunteering time is sort of the cornerstone of the communities in the first place. Sincere thanks!

Change 424738 merged by jenkins-bot:
[mediawiki/core@master] CRH Transliteration Pattern Matching Fixes

Working on the live CRH Wikipedia.