Page MenuHomePhabricator

native language name for Northern Tatar (code 'sty', written in Cyrillic) uses an incorrect initial Latin small letter C instead of the Cyrillic small letter ES (also check English)
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:

  • <code>{{#language:sty}}</code> (get the native language name for Northern Tatar, whose ISO 639 code is 'sty')
  • See test cases and click on the column for native names to see how it is sorted.

Actual Results:

  • "cебертатар", where the initial is an incorrect Latin 'c' U+0063 (transcriptable to a Cyrillic letter 'к' or 'х')
  • The language name currently sorts incorrectly with other names written in the Latin script, and cannot be found by plain text search with its normal Cyrllic orthography.

Expected Results:

  • "себертатар'', where the initial is the correct Cyrillic 'c' U+0441 (transcriptable to a Latin letter 's'), or "Себертатар'' (if we want capitalization of the initial 'С' U+0421)
  • The language name will sort correctly with other names written in the Cyrillic script and is found in plain text search.

Event Timeline

Verdy_p created this task.Mar 7 2020, 5:31 PM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptMar 7 2020, 5:31 PM
Verdy_p updated the task description. (Show Details)Mar 7 2020, 5:33 PM
Reedy added a subscriber: Reedy.Mar 7 2020, 7:35 PM

This seems to have been this way since T186359: Add Siberian Tatar (sty) to Names.php, so over 2 years at this point

Change 577804 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Fix native sty name from cебертатар to себертатар

https://gerrit.wikimedia.org/r/577804

Reedy added a comment.Mar 7 2020, 7:52 PM

Curiously, cебертатар gets more google results than себертатар

There's no reason why only the first letter should be Latin and not the remaining letters "e", "p", "t", "a" of this word (which are actually all Cyrillic like the rest of the language).

Google results are most probably indexing contents generated by Mediawiki in its bogous translation data, and its indexing agents can handle these possible confusions (including for cases where the langauge is written with a Latin approximation when there's no way to type Cyrillic, but in this case all the word should be Latin.

And I seriously doubt that the Latin letters C and P will be used to transliterate the Cyrillic letters ES and ER, given that they are very different phonetically (this could still happen for A, BE, E, TE without creating real confusion); if you can correctly type the cyrillic letters A, BE, E, ER, TE with an existing Cyrillic keyboard layout, you should be able to type the Cyrillic letter ES correctly with the same layout).

The most probable origin was that this word was initially written correctly entirely in Cyrillic but with the leading uppercase, then later someone wanted to change its capitalization and overwrite only the initial with the incorrect Latin letter.

Also If I google it now, I actually see more results with the Cyrillic initial (in serious articles, news, blogs, talks...) than with the Latin initial (where it appears only in technical lists autogenerated from some data that must come initially from Mediawiki, but not within plain sentences !) Serious linguistic data sources all use the Cyrillic initial and don't mixup Latin !

And now after a few minutes that this fix is accepted and documented, I see numerous wikis that used the incorrect cyrillic letter, plus a few other sites, correcting it in their lists of language codes. It is clear that their source was Mediawiki and the error in Mediawiki was propagated elsewhere.
Google also has made more extensive searches and finds now many more references to the Cyrrilic-only orthograph) and this is growing, while the number of results with the Latin latter is now shrinking quite rapidly (Google must also have updatred its own indexing thresholds for results significance, their bots are aware of what happens here!)

Ebe123 added a subscriber: Ebe123.Mar 8 2020, 1:08 AM

@Verdy_p, the fix is not yet accepted; it still needs a "+2" from someone allowed to while I've just given a "+1" as I don't have that access. Once the patch is truly accepted, it will take a few weeks for the change to go to all the Wikimedia wikis.

Change 577804 merged by jenkins-bot:
[mediawiki/core@master] Fix native sty name from cебертатар to себертатар

https://gerrit.wikimedia.org/r/577804

Still not deployed. this native name still sorts incorrectly between "Cymraeg" and "dansk" in Latin-written scripts.
Note that the customized sort order in [https://commons.wikimedia.org/wiki/Module:Multilingual_description/sort] fixes it (but only for Commons): the effect is not visible by humans in the automatic sort order for languages (including languages navboxes that use this module) on Commons, but this bug affects all other templates or other wiki not fixing it and using basic sort (there are countless examples, including the homepage of wikimedia.org).

The typos is usually not visible with modern fonts, but is visible in some historic or handwritten styles (that may distinguish Cyrillic and Latin letter forms or sizes, or would not correctly join Latin letters with Cyrllic letters) It's exactly the same if you make thje confusion between Latin/Greek/Cyrillic letters A/ALPHA/A, B/BETA/VE, C/SIGMA/ES, E/ETA/IE, HEN, I/IOTA/BIELORUSS-I, LEL, M/MU/EM, N/NU/EN, O/OMICRON/O, P/RHO/ER, T/TAU/TE, Y/UPSILON, Z/ZETA//E, as it ocurred when "ASCII hack" before Unicode cuasing lot of incorrect interpretations and readings.

And without the fix, the displayed name is still incorrect (some wikis don't rely on #language only and provide templates or modules to fix incorrect or unsupported names that are not built in the "#language:" parser function. This still causes havoc, such as various sites still displaying the language name incorectly, and even pretending that this is the correct name, so lot of people just copy-paste this name without checking it, and other non-wiki sites or external data and texts are being created with this incorrect orthography (whose sole source is Wikimedia... that invented a word which was exposed too broadly: a single error made by a single human in Wikimedia changed radically this language).
I don't know why this is still not applied in Mediawiki and deployed on all wikimedia wikis (with some technical release note to inform users that this changed, if ever there's some mysterious dependency on the broken name: if all what is affected is home pages or categories are named, it's quite simple to rename these pages to match the new name, and these are not difficult to locate; after renaming there will remain a redirecting page, whose incoming links are easy to list and fix).
For other occurences (e.g. in plain text) a search engine can easily spot most of them., like what is done when other pages are moved or their titles need to be changed for disambiguating them, so this is not exceptional and not a serious technical limitation: there's no need of admin rights if these pages are not protected; but if you deploy the change, you shoul post a notice to the bulletin board of adminsitrators of the affected wikis; I've also found that some wikis already implement redirecting pages to point the broken title to the title with standard orthography written in Cyrillic only).
This is really like other typos: obvious typos can be fixed without lot of talks (there are typos everywhere in all wikis, and a constant flow of fixes which never requires a long discussion, many of these are ven automated by bots, that were not ordered to stop this work, these approved bots maintain a public word list which can be customized at any time if an imperative rule has to be canceled due to ambiguities; but here there's no ambiguity at all)

Reedy added a comment.May 22 2020, 8:07 PM

Still not deployed. this native name still sorts incorrectly between "Cymraeg" and "dansk" in Latin-written scripts.

It was only merged yesterday. Of course it isn't.

Nikerabbit closed this task as Resolved.Jun 22 2020, 1:01 PM
Nikerabbit claimed this task.
Nikerabbit removed a project: Language-Team.
Verdy_p added a comment.EditedJun 29 2020, 2:16 PM

This is fixed only in the native name; still there's no English name assigned, which is still displayed as "себертатар" instead of "Northern Tatar".

As well the English names for several languages include a RLM mark as they just replicate the native name (which is also incorrect):

  • [es-formal] = "español (formal)&lrm;" (no RLM needed even in native Spanish!) – should be: "Spanish (formal)" in English
  • [hu-formal] = "magyar (formal)&lrm;" (no RLM needed even in native Hungarian!) – should be: "Hungarian (formal)" in English
  • [nl-informal] = "Nederlands (informeel)&lrm;" (no RLM needed even in native Dutch!) – should be: "Dutch (informal)" in English
  • [vo] = "Volapük" – should probably be "Volapuk" in English (without the combining diaeresis)
  • [vro] = [fiu-vro]= "Võro" – should probably be "Voro" in English (without the combining tilde)

As a general rule, the English names of all languages should be plain ASCII only (of course, this does not apply to other translations or native names)...
This is also checked on the same test page: https://commons.wikimedia.org/wiki/Module_talk:Multilingual_description/sort/testcases (where you can see the red cells in the last column) using the following basic regular expression:

/^[A-Z][ '()%-/0-9A-Za-z]*['()%-/0-9A-Za-z]$/

The reason for that is that the English names of languages is used in contexts where only ASCII is expected (spaces, parentheses, hyphens, single quotes or decimal digits are still possible)
The test page also HTML-encode the spaces to makes sure they are not duplicated in the middle (but this is not dramatic and not signaled as an error)

Verdy_p renamed this task from native language name for Northern Tatar (code 'sty', written in Cyrillic) uses an incorrect initial Latin small letter C instead of the Cyrillic small letter ES to native language name for Northern Tatar (code 'sty', written in Cyrillic) uses an incorrect initial Latin small letter C instead of the Cyrillic small letter ES (also check English).Jun 29 2020, 2:22 PM
Verdy_p reopened this task as Open.
Reedy added a comment.Jun 29 2020, 2:47 PM

This is fixed only in the native name; still there's no English name assigned, which is still displayed as "себертатар" instead of "Northern Tatar".

Why would there be? You didn't report that there wasn't one, nor request one to be added

Verdy_p added a comment.EditedJun 29 2020, 2:59 PM

Because the fix was partial and not reviewed correctly (with all the coverage needed). I had also already indicated the test page (on Commons) where you can see all this (and where the bug was initially detected since the start)

Reedy added a comment.Jun 29 2020, 3:11 PM

Because the fix was partial and not reviewed correctly. I had also already indicated the test page (on Commons) where you can see all this (and where the bug was initially detected since the start)

Where in your original report does it say that there isn't an English name assigned?

Sure, you've changed it now, but it wasn't there originally.

Also, listing a test page on Commons isn't very helpful. Again, you only talk about native names, not the English ones. There was no mention that missing English names need fixing either.

We're not mind readers.

You're also listing other things on this bug that may be bugs, but have no relation to this issue and should be filed as seperate issues

Reedy updated the task description. (Show Details)Jun 29 2020, 3:11 PM
Aklapper closed this task as Resolved.Jun 29 2020, 3:14 PM

Please do not expand the scope of tasks and stay on-topic. Resetting status again.

But the native names for "formal"/"informal" variants are still incorrect with RLM.
A test page is a proof, and helpful to explain what is expected That test page has a full coverage of all locales supported in Mediawiki (at least its version currently deployed in Commons).