Page MenuHomePhabricator

incorrect English names for languages (they display the native names only)
Open, MediumPublic

Description

The English names for several languages include a LRM mark as they just replicate the native name (which is also incorrect in the first three cases), or just render the native name (autonym):

  • [es-formal] = "español (formal)‎" (no LRM needed even in native Spanish!) – must be: "Spanish (formal)" in English
  • [hu-formal] = "magyar (formal)‎" (no LRM needed even in native Hungarian!) – must be: "Hungarian (formal)" in English
  • [nl-informal] = "Nederlands (informeel)‎" (no LRM needed even in native Dutch!) – must be: "Dutch (informal)" in English
  • [gsw] = "Alemannisch" – should be "Alemannic" in English
  • [sty] = "себертатар" – must be "Northern Tatar" in English
  • [vo] = "Volapük" – should probably be "Volapuk" in English (without the combining diaeresis)
  • [vro] = [fiu-vro]= "Võro" – should probably be "Voro" in English (without the combining tilde)

The following test page also HTML-encode the spaces to makes sure they are not duplicated in the middle (but this is not dramatic and not signaled as an error)

https://commons.wikimedia.org/wiki/Module_talk:Multilingual_description/sort/testcases

As a general rule, the English names of all languages should be plain ASCII only (of course, this does not apply to other translations or native names)...
This is also checked on the same test page where you can see the red cells in the last column) using the following basic regular expression:

/^[A-Z][ '()%-/0-9A-Za-z]*['()%-/0-9A-Za-z]$/

The reason for that is that the English names of languages is used in contexts where only ASCII is expected (spaces, parentheses, hyphens, single quotes, and slashes are still possible; applications are generally aware if these ASCII punctuations or spaces have to be replaced; decimal digits may occur in the name of some variants, like a year for an orthographic reform, but they generally don't cause problems)

Yellow cells on the test page just signal cases where the autonym and the English name are identical (not necessarily an error, but it may indicate a missing translation, either in English or in the native name; some of these cases are OK like "Esperanto", whose autonym is correctly capitalized for that language).

Note the LRM/RLM marks should not be used at all in any language

  • For the few languages that display two native names in different scripts (when we don't specify the script variant), the solution is to write the LRM name first then the "/" then the RLM name.
  • For correctly formatting lists of languages (showing their autonyms), the solution is to use Bidi isolation ("bdi" element in HTML, or the equivalent "bidi-isolation:isolate" in CSS) for each item in the multilingual list. LRM/RLM are deprecated (they are not isolates, but deprecated overrides). See T252568.

Isolates is the recommandation in the second version of the UBA (published many years ago) that was made to replace and deprecate all overrides (including the "bdo" HTML element, and RLM/LRM controls that was the only solution in the first version of UBA and in HTML4).


In all cases, any trailing RLM or LRM without any known character after it is wrong: their use should be limited to just very specific characters where one wants to change its weak or strong directionality or its mirroring, for a context of use within a text with known language/script (for example to change the strong directionality of Latin letters or digits in an Arabic text).

Such use of Bidi-overrides is very exceptional and only needed inside very specific names (like some brands/trademarks using these characters as if they were normal Arabic letters, or for uncommon notations of numbers when an arabic text wants to present these numbers with a strong RTL direction, instead of their default LTR direction, opposed to the normal direction or reading; note that even Arabic and Persian digits are LTR, as they are written starting from most significant digit to the left and then other digits in backward reading order...).

A specific context allows borrowing Hebrew letters in Latin texts and treat them as if they were a Latin letter with string LTR direction: LRM is then useful before that Hebrew letter only (is is found in some Latin names borrowing an Hebrew Aleph, but not needed for maths where there's a Aleph mathematical symbol which is already LTR)

The other case for using Bidi-overrides is for historic texts when a script was using their current modern direction (e.g. boustrophedon, or old Greek and Coptic written RTL). For such cases, "bdo" is still the best solution to embed a full line, and there's still no real need of RLM/LRM for just a single character except to force its mirroring (e.g. an arrow).


Request for patch of LocalNamesEn.php per comment T256649#7160228 below:

Event Timeline

Verdy_p updated the task description. (Show Details)

Aren't these supposed to be translated to English from CLDR data (except variants like "formal" and "unformal"), whereas only autonyms are in languages/Name.php?
Or is there another source (in messages imported from translatewiki.net)?

Anyway I don't see the rationale for including a LRM marker (U+200E) in every entry of

(mediawiki/core)/master/languages/data/Names.php

whose value is terminated by a closing parenthese.

All these should be clearly removed: you don't know which character will follow to force it to use the LTR direction (which is clearly invalid if these language names are used in RTL lists: this corrupts the list order. And this is incoherent when there are also language names using mixed scripts with distinct directions (Latin/Arabic) or just RTL (including Arabic itself even if it has no parentheses).

In all case, we need to isolate ALL autonyms inside "bdi" elements to restore a functional and readable list with correct ordering. And if using "bdi" to encapsulate them, these final RLM marks loose ALL their effect, they are clearly spurious pollutions.

This affects:

  • 'be-tarask' => "беларуская (тарашкевіца)\u{200E}", # Belarusian in Taraskievica orthography
  • 'be-x-old' => "беларуская (тарашкевіца)\u{200E}", # (be-tarask compat)
  • 'crh-latn' => "qırımtatarca (Latin)\u{200E}", # Crimean Tatar (Latin)
  • 'crh-cyrl' => "къырымтатарджа (Кирилл)\u{200E}", # Crimean Tatar (Cyrillic)
  • 'de-formal' => "Deutsch (Sie-Form)\u{200E}", # German - formal address ("Sie")
  • 'es-formal' => "español (formal)\u{200E}", # Spanish formal address
  • 'gan-hans' => "赣语(简体)\u{200E}", # Gan (Simplified Han)
  • 'gan-hant' => "贛語(繁體)\u{200E}", # Gan (Traditional Han)
  • 'hu-formal' => "magyar (formal)\u{200E}", # Hungarian formal address
  • 'kk-arab' => "قازاقشا (تٴوتە)\u{200F}", # Kazakh Arabic ↳ (terminated by LRM instead of RLM... another incoherence!)
  • 'kk-cyrl' => "қазақша (кирил)\u{200E}", # Kazakh Cyrillic
  • 'kk-latn' => "qazaqşa (latın)\u{200E}", # Kazakh Latin
  • 'kk-cn' => "قازاقشا (جۇنگو)\u{200F}", # Kazakh (China) ↳ (terminated by LRM instead of RLM... another incoherence!)
  • 'kk-kz' => "қазақша (Қазақстан)\u{200E}", # Kazakh (Kazakhstan)
  • 'kk-tr' => "qazaqşa (Türkïya)\u{200E}", # Kazakh (Turkey)
  • 'ku-latn' => "kurdî (latînî)\u{200E}", # Northern Kurdish (Latin script)
  • 'ku-arab' => "كوردي (عەرەبی)\u{200F}", # Northern Kurdish (Arabic script) (falls back to ckb) ↳ (terminated by LRM instead of RLM... another incoherence!)
  • 'nl-informal' => "Nederlands (informeel)\u{200E}", # Dutch (informal address ("je"))
  • 'nrm' => 'Nouormand', # Norman (invalid code; 'nrf' in ISO 639 since 2014) ↳ (why are you mapping only "nrm" but still not "nrf"?)
  • 'sr-ec' => "српски (ћирилица)\u{200E}", # Serbian Cyrillic ekavian ↳ (why do you keep it, why not adding "sr-cyrl"?)`
  • 'sr-el' => "srpski (latinica)\u{200E}", # Serbian Latin ekavian ↳ (why do you keep it, why not adding "sr-latn"?)
  • 'zh-cn' => "中文(中国大陆)\u{200E}", # Chinese (PRC)
  • 'zh-hans' => "中文(简体)\u{200E}", # Mandarin Chinese (Simplified Chinese script) (cmn-hans)
  • 'zh-hant' => "中文(繁體)\u{200E}", # Mandarin Chinese (Traditional Chinese script) (cmn-hant)
  • 'zh-hk' => "中文(香港)\u{200E}", # Chinese (Hong Kong)
  • 'zh-mo' => "中文(澳門)\u{200E}", # Chinese (Macau)
  • 'zh-my' => "中文(马来西亚)\u{200E}", # Chinese (Malaysia)
  • 'zh-sg' => "中文(新加坡)\u{200E}", # Chinese (Singapore)
  • 'zh-tw' => "中文(台灣)\u{200E}", # Chinese (Taiwan)

All this looks like "quirks" created a long time ago, before the support for "bdi" elements was added into now very old versions of Mediawiki. All these "overrides" are really harmful, they cause today more problems than what they attempted to solve, and UBA v2 (standardizing the "isolates") has been published many years ago in a now old version of Unicode, and almost all browsers have this UBA v2 implemented.

The legacy UBA v1 (with overrides only) is almost no longer used, and it proved to be very broken, very hard to process correctly in a multilingual templated context. Overrides are only kept for legacy static contents that are already isolated in their own document where all characters have their direction resolved locally.

Strong Bidi-overrides (RLM/LRM+character, or longer contents inside "bdo") should only be used with known characters after them.

I came here because of the LRM mark. I once authored T244787 which addressed some of the same issues are are addressed here. That, as you can see was declined.

Another oddity that we have come across is {{#language:he|am}} which is terminated with U+FEFF zero width no-break space.

Also, is there a reason that some language names use U+0027 apostrophe (O'odham is one) and other language names use U+2019 right single quotation mark (Cànan Hawai’i is one – but shouldn't the ʻokina be U+02BB modifier letter turned comma?)

Esc3300 added subscribers: Amire80, jhsoby, Esc3300.

Can we add the English names in the description above to LocalNames? @Amire80 @jhsoby

Esc3300 set Due Date to Jun 25 2021, 12:00 AM.

@Esc3300: Please don't set Due Dates for no clear reason. Thanks.

@Aklapper: Next step for defining local names would be to get langco approval. Per T284276 , this would generally happen with 2 weeks. So the due date for this is June 25. Is there a better way to define these dates? I can leave a note in the comments just before or after setting the dates.

T284276 started as the entire process for new language codes at Wikidata results in a suboptimal user experience for volunteer contributors who make a somewhat trivial request.

@Esc3300: If this is a hard deadline that you plan to follow up on, then please feel free to set it. (Sorry, that was not clear to me beforehand.) Thanks! :)

@Aklapper : ideally a nice workflow routine would move it to the next step: add a patch-welcome tag and set a new target date. I guess I (or someone else) will have to do it manually.

About the request: the following are currently consistent with enwiki/Wikidata label. I don't think these should be changed.

The names of the following seem somewhat inconsistent between languages. I'd fix them separately.

Per enwiki/Wikidata, "Siberian Tatar" seems to be the name for:

Suggested changes for the following seem ok:

To summarize: I'd only add the last four to LocalNamesEn.php

Esc3300 triaged this task as Medium priority.Jun 21 2021, 7:28 AM
Esc3300 set Due Date to Jun 28 2021, 12:00 AM.
Esc3300 added projects: patch-welcome, Wikidata.
Esc3300 updated the task description. (Show Details)
Esc3300 removed subscribers: jhsoby, Amire80.

I updated the task accordingly and set the one week for the patch as "due date".

If this is a hard deadline that you plan to follow up on, then please feel free to set it. (Sorry, that was not clear to me beforehand.) Thanks! :)

@Esc3300: This is not a hard deadline, as pointed out in T284276#7160239. Please do not set Due Dates if you don't plan to work on stuff. Thanks.

Change 701662 had a related patch set uploaded (by Mbch331; author: Mbch331):

[mediawiki/extensions/cldr@master] Missing translations for several (in)formal languagee codes

https://gerrit.wikimedia.org/r/701662

As the four languages are also incorrectly labeled on WD, can WB-WMDE review the patch?

Change 701662 merged by jenkins-bot:

[mediawiki/extensions/cldr@master] Add missing translations for several (in)formal languagee codes

https://gerrit.wikimedia.org/r/701662

It works for the four above. Shall we close this as done?

It works for the four above. Shall we close this as done?

Not obvious to me that anything has changed. At en.wiki

  • {{#language:sty|en}} still returns: себертатар
  • {{#language:es-formal|en}} still returns: español (formal)
  • {{#language:hu-formal|en}} still returns: magyar (formal)
  • {{#language:nl-informal|en}} still returns: Nederlands (informeel)

Close when done; not before.

It worked on Wikidata. Is ok now or is some other feature needed?

In the week or so since my post (T256649#7246452), the four codes mentioned have been fixed at en.wiki.

At en.wik, a subset of those mentioned in the OP:

  • [gsw] = "Alemannisch" – should be "Alemannic" in English
  • [sty] = "себертатар" – must be "Northern Tatar" in English
  • [vo] = "Volapük" – should probably be "Volapuk" in English (without the combining diaeresis)
  • [vro] = [fiu-vro]= "Võro" – should probably be "Voro" in English (without the combining tilde)

are different from OP's suggestions:

  • [gsw] = Swiss German
  • [sty] = Siberian Tatar – note that OP says 'must be "Northern Tatar"' (emphasis added)
  • [vo] = Volapük – not changed counter to OP's suggestion
  • [fiu-vro] = võro – why lowercase when:
    • [vro] = Võro (also not changed counter to OP's suggestion)? but see T256649#7160228

Except for the Võro / võro capitalization discrepancy, I have no objections to the name choices that differ from OP's suggestions.

Another oddity that we have come across is {{#language:he|am}} which is terminated with U+FEFF zero width no-break space.

Also, is there a reason that some language names use U+0027 apostrophe (O'odham is one) and other language names use U+2019 right single quotation mark (Cànan Hawai’i is one – but shouldn't the ʻokina be U+02BB modifier letter turned comma?)

The above have not been answered. In that post I neglected to mention that Cànan Hawai’i ← {{#language:haw|gd}} and O'odham ← {{#language:ood|en}}. There are about 570 language-code / target-language-code pairs that render the language name with U+2019 right single quotation mark and about 1770 pairs that render the language name with U+0027 apostrophe (no doubt many of these in both groups are fall-backs to some other target language). At en.wiki, only nqo renders a language name with U+2019 right single quotation mark: N’Ko ← {{#language:nqo|en}}. For code nqo, the ISO 639 custodians use U+0027 apostrophe for 'N'ko' (ISO 639-3 and ISO 639-2 English) and for 'n'ko' (ISO 639-2 French). IANA, in the language-subtag-registry file, supports both 'N'ko' (U+0027 apostrophe) and 'N’Ko' (U+2019 right single quotation mark).

Still, shouldn't the U+2019 right single quotation mark be replaced with the U+0027 apostrophe in cases where U+2019 does not have special meaning (glottal stops or whatever)? And where U+2019 right single quotation mark is used in place of special characters like ʻokina (Hawaiʻian and other languages), shouldn't U+2019 right single quotation mark be replaced with U+02BB modifier letter turned comma or other appropriate character?

The ticket is about English names of languages and autonyms (as per its title).

After the ticket lingered for considerable time, I reviewed it at T256649#7160228 and we did some fixes based on this review.

[fiu-vro]= "võro" (in English) instead of [fiu-vro]= "Võro" seems to have come up after the ticket was created (see https://phabricator.wikimedia.org/rMWdbb1a9ef64604cedfd85b7c04a6b4699cad5c3c2). If you think the legacy code "fiu-vro" needs should have a different name, you might want to create a ticket for it.

Other things might better be handled separately as well.

It works for the four above. Shall we close this as done?

Not obvious to me that anything has changed. At en.wiki

  • {{#language:sty|en}} still returns: себертатар
  • {{#language:es-formal|en}} still returns: español (formal)
  • {{#language:hu-formal|en}} still returns: magyar (formal)
  • {{#language:nl-informal|en}} still returns: Nederlands (informeel)

If the edits was to remove the superfluous LRM mark, you won't notice the change visually, in any page written in those languages, but only in pages written in RTL scripts (Arabic, Hebrew, etc.). But you may see it if you use HTML inspector in a browser's console to scan selected text elements and enumerate the codepoints present in them, or if you use a special view mode with "visible controls" that render the position of Bidi controls as special symbols).

As well labels that are in Wikidata are not relevant to the default language names used in #language parser functions in Mediawiki internal data or from CLDR.

Not clear what you are talking about. Sometime between the time that I wrote T256649#7246452 and today, someone has fixed something so that at en.wiki the language names now render in English as they should:

  • {{#language:sty|en}} returns: Siberian Tatar
  • {{#language:es-formal|en}} returns: Spanish (formal address)
  • {{#language:hu-formal|en}} returns: Hungarian (formal address)
  • {{#language:nl-informal|en}} returns: Dutch (informal address)

That is what I wanted to see fixed and now that it has been, thank you to whomever it was who did the fixing. I don't think that I have ever written about wikidata in this discussion because that is a place that I prefer to avoid.

Possibly fixed by rMW02e1a2f83e99: Make LanguageNameUtils more lenient with input which changed handling of non-standards compliant language codes and/or updated language names.