Page MenuHomePhabricator

Crimean Tatar/crh transliteration odds and ends
Closed, ResolvedPublic

Description

A few items came up in the review of T188321 that should be addressed, but are separate (and smaller) issues than the big fixes made there:

  • refactor \b in regexes into a $wordBoundary variable so that it is easy to do something smarter and more location aware in the future (once we figure out what that is)
  • add some new exceptions that came up from last-minute review of examples in Tatar transliteration, plus some more proper names
  • possibly figure out what to do about roman numerals. The last patch ignores roman numerals as long as they are not one letter long and followed by a period (that is, as long as it doesn't look like an initial). Possibilities include:
    • stop trying to be clever and ignore roman numerals entirely, letting editors explicitly -{mark them}- as not to be transliterated
    • only automatically block roman numerals that are two-letters or longer which really cuts down on false positives
    • stick with the current system

(I'm happy with any of the roman numeral options—we just have to decide which one is the one we want.)

Event Timeline

New general exceptions to add:

// ц-related
'motsart' => 'моцарт', 'metsenat' => 'меценат', 'brutsel' => 'бруцел', 'pritsep' => 'прицеп', 'donitsetti' => 'доницетти', 'yatsenük' => 'яценюк', 'epitsentr' => 'эпицентр', 'plats' => 'плац', 'bogorodits' => 'богородиц', 'pretsedent' => 'прецедент', 'spets' => 'спец', 'troits' => 'троиц', 'kontratsep' => 'контрацеп', 'şprits' => 'шприц', 'dratsena' => 'драцена', 'pretses' => 'прецес', 'mitsel' => 'мицел', 'platsen' => 'плацен', 'kotsüb' => 'коцюб', 'datsük' => 'дацюк',

// more proper names
'koreiz' => 'кореиз', 'boris' => 'борис'

An additional suffix exception to add:

'itsın' => 'ицын'

Change 430928 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] [WIP] Crimean Tatar/crh transliteration odds and ends

https://gerrit.wikimedia.org/r/430928

Unless there are any thoughts on roman numerals, I'm going to change the patch from [WIP] to ready-for-review. I'm okay with punting the roman numeral discussion to later to get this out the door now.

I've dropped the roman numeral stuff from this particular task since there was no discussion on it and the boundary refactoring and new exceptions can go ahead without it.

Change 430928 merged by jenkins-bot:
[mediawiki/core@master] Crimean Tatar/crh transliteration odds and ends

https://gerrit.wikimedia.org/r/430928

The change is merged, so the new exceptions should be live next week. I'll close this ticket when I can verify that they are live.

Checked today and the changes are live.

Vvjjkkii renamed this task from Crimean Tatar/crh transliteration odds and ends to zodaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed TJones as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 2:07 AM