Maniphest T193764

Crimean Tatar/crh transliteration odds and ends
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	May 3 2018, 4:50 PM

Description

A few items came up in the review of T188321 that should be addressed, but are separate (and smaller) issues than the big fixes made there:

refactor \b in regexes into a $wordBoundary variable so that it is easy to do something smarter and more location aware in the future (once we figure out what that is)
add some new exceptions that came up from last-minute review of examples in Tatar transliteration, plus some more proper names
possibly figure out what to do about roman numerals. The last patch ignores roman numerals as long as they are not one letter long and followed by a period (that is, as long as it doesn't look like an initial). Possibilities include:
- ~~stop trying to be clever and ignore roman numerals entirely, letting editors explicitly -{mark them}- as not to be transliterated~~
- ~~only automatically block roman numerals that are two-letters or longer which really cuts down on false positives~~
- ~~stick with the current system~~

~~(I'm happy with any of the roman numeral options—we just have to decide which one is the one we want.)~~

Details

	Subject	Repo	Branch	Lines +/-
	Crimean Tatar/crh transliteration odds and ends	mediawiki/core	master	+191 -163

Customize query in gerrit

Related Objects

Mentioned Here: T188321: CRH Transliteration pattern matching fixes

Event Timeline

TJones created this task.May 3 2018, 4:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2018, 4:50 PM

@cscott and @DonAlessandro: any more thoughts on roman numerals?

New general exceptions to add:

// ц-related
'motsart' => 'моцарт', 'metsenat' => 'меценат', 'brutsel' => 'бруцел', 'pritsep' => 'прицеп', 'donitsetti' => 'доницетти', 'yatsenük' => 'яценюк', 'epitsentr' => 'эпицентр', 'plats' => 'плац', 'bogorodits' => 'богородиц', 'pretsedent' => 'прецедент', 'spets' => 'спец', 'troits' => 'троиц', 'kontratsep' => 'контрацеп', 'şprits' => 'шприц', 'dratsena' => 'драцена', 'pretses' => 'прецес', 'mitsel' => 'мицел', 'platsen' => 'плацен', 'kotsüb' => 'коцюб', 'datsük' => 'дацюк',

// more proper names
'koreiz' => 'кореиз', 'boris' => 'борис'

An additional suffix exception to add:

'itsın' => 'ицын'

Change 430928 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] [WIP] Crimean Tatar/crh transliteration odds and ends

https://gerrit.wikimedia.org/r/430928

gerritbot added a project: Patch-For-Review.May 4 2018, 3:34 PM

Unless there are any thoughts on roman numerals, I'm going to change the patch from [WIP] to ready-for-review. I'm okay with punting the roman numeral discussion to later to get this out the door now.

I've dropped the roman numeral stuff from this particular task since there was no discussion on it and the boundary refactoring and new exceptions can go ahead without it.

TJones updated the task description. (Show Details)May 9 2018, 6:00 PM

Change 430928 merged by jenkins-bot:
[mediawiki/core@master] Crimean Tatar/crh transliteration odds and ends

https://gerrit.wikimedia.org/r/430928

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-05-29 (1.32.0-wmf.6)).May 23 2018, 4:00 PM

The change is merged, so the new exceptions should be live next week. I'll close this ticket when I can verify that they are live.

Checked today and the changes are live.

• Vvjjkkii renamed this task from Crimean Tatar/crh transliteration odds and ends to zodaaaaaaa.Jul 1 2018, 1:12 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed TJones as the assignee of this task.

• Vvjjkkii triaged this task as High priority.