Page MenuHomePhabricator

Local language name should be translatable in translatewiki.net
Open, Needs TriagePublic

Description

The endonym of each supported language is part of its definition, but the translation of hte name into other languages (exonym) can be treated as data.

The files for exonyms not provided by CLDR are in the LocalNames folder: https://phabricator.wikimedia.org/diffusion/ECLD/browse/master/LocalNames/

Currently we have to modify the repository manually, like T162406: Add Scots names for languages, or else people can contribute upstream to CLDR with the survey tool (the English-language exonym needs to be added to the English locale first and then it can be translated to any of the ~200 supported languages).

Event Timeline

Maybe we can work on this during the hackathon? I’ve heard a few interested people will be there :)

Do you want to focus on the exonyms in languages which are supported by MediaWiki core (or at least translatewiki.net) but not in CLDR?

Nemo_bis renamed this task from Local language name should be translatable in translatewiki to Local language name should be translatable in translatewiki.net.May 3 2024, 8:05 AM
Nemo_bis updated the task description. (Show Details)

Do you want to focus on the exonyms in languages which are supported by MediaWiki core (or at least translatewiki.net) but not in CLDR?

Kind of, I think. I was just looking through this with @Nikki, and the extension also overrides some of CLDR’s names in LocalNames; for instance, CLDR has 'gan' => 'Gan Chinese' which LocalNames overrides to 'gan' => 'Gan' (nan and wuu likewise lose “Chinese”), or 'und' => 'Unknown language' is being overridden to 'und' => 'undetermined language' (should be lower case). We’d like to make these overrides translatable in translatewiki.net – however, we don’t want translators wasting their time translating the hundreds of language names that are in CLDR and don’t have any issues.

Proposal: We add to i18n/ all the language codes that aren’t in upstream CLDR yet, as well as all the ones that are being overridden in English (assuming that we want those overrides to be translated), but not the ones where LocalNamesEn doesn’t have an entry or is equal to CLDR (assuming that we don’t want those translated either).

Hm, two complications.

  1. How do we get the LocalNames translations into upstream CLDR? We already have some cases where a language name is both in LocalNames and in CldrNames (currently nyo and vmw in English), where the override isn’t necessary in principle (the English names are identical in both files), but we have some translations that aren’t in CLDR (e.g. LocalNamesNo has 'nyo' => 'nyoro' and 'vmw' => 'makhuwa', but only the former is also in CldrNamesNo). When we make it easier to translate language names, we’re going to have many more cases like that. It would be nice to remove our overrides after a language is added in upstream CLDR, but I don’t even know what the license conditions of CLDR are.
  1. We have some LocalNames files that don’t seem to exist in CLDR at all, e.g. we have LocalNamesMnw (mnw is the Mon language) but no CldrNamesMnw. In those cases, I guess we want to allow translations of all language codes (even 'en'), even when we don’t want to allow those language codes to be translated into most other languages… I think it might be best to leave these cases to PHP files (updated via Gerrit), i.e. the status quo, and not move them to translatewiki.net at all.

In https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2023/Translation/Translatable_language_names last year, I suggested adding all the languages and then importing the translations from CLDR so people don't have to retranslate them. If we did that, it could also update translations when they change in CLDR (maybe only if the name in Translatewiki matches the previous CLDR name, if we want to avoid overwriting names).

@Nikki kirjoitti kommentissa T231755#9766263:

In https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2023/Translation/Translatable_language_names last year, I suggested adding all the languages and then importing the translations from CLDR so people don't have to retranslate them. If we did that, it could also update translations when they change in CLDR (maybe only if the name in Translatewiki matches the previous CLDR name, if we want to avoid overwriting names).

That seems like a reasonable starting point. It would allow overriding existing translations from CLDR, but encourage people to submit them to upstream to avoid them being overwritten by every update from CLDR.

Hmmm, I think that could work, yeah. If we always overwrite the translations with the CLDR data each time we update to a new CLDR version, we could put something like this in qqq.json for the affected language codes:

Note: This language name is translated in the CLDR database. Any changes you make here will be overwritten the next time the CLDR data is updated; please also submit your changes to the CLDR database. [TODO links to useful documentation]

And one benefit of this would be that people can concurrently submit new translations to CLDR and on translatewiki.net, and the translations would go live on Wikimedia pretty quickly (within a week), and then in the longer term they would come from CLDR.

And in cases where people create a translation only on translatewiki.net, but later CLDR publishes a different translation, they’ll hopefully see the change in their watchlist.

I think license-wise, uploading the CLDR data to translatewiki.net should be fine; the Unicode License v3¹ (archived) only requires that “this copyright and permission notice appear in associated Documentation”, which seems pretty generous (we can include it in the message group documentation or something like that, and then that’s the associated Documentation).

¹: cldr.git’s LICENSE is an older version btw (and somewhat more restrictive in fact), we should probably update that ^^

Hm, two complications.

  1. How do we get the LocalNames translations into upstream CLDR?

As far as I know, we don't actively contribute the translations to upstream CLDR at the moment.

According to https://github.com/unicode-org/cldr/blob/main/docs/requesting_changes.md people can contribute data by making tickets in their bug tracker or by adding it using the CLDR Survey Tool. I think @Reedy is the person to contact about getting access to the Survey Tool.

(It would be nice to have more documentation about how Wikimedians can contribute to CLDR in general, e.g. there are various languages that MediaWiki supports that CLDR doesn't have locales for and it would be nice if we could help speakers of those languages provide the data that CLDR wants... but that's a bit off-topic for this ticket)

Change #1026805 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/cldr@master] Update LICENSE to Unicode License v3

https://gerrit.wikimedia.org/r/1026805

Change #1026805 merged by jenkins-bot:

[mediawiki/extensions/cldr@master] Update LICENSE to Unicode License v3

https://gerrit.wikimedia.org/r/1026805

FWIW, I think with the new suggested approach, the two complications I mentioned above aren’t really relevant anymore. Or at least, I’m happy to go ahead with it for now ^^

For upstreaming translation to CLDR, see relevant discussion at T151269: Add English names for languages which don't yet have one. @Nemo_bis is known as a point of contact for that.

the English-language exonym needs to be added to the English locale first

For languages listed in ISO 639-3, we can simply fill them from IANA registry. The relevant task is T168799: Integrate IANA language registry with language-data and MediaWiki (let MediaWiki "knows" all languages with ISO 639-1/2/3 codes).

This leaves some language codes with script/country/region code - in long-term future we should not maintain each variant either, instead they should be derived from base language name and name of country/script pragmically.

We might want to have two messages for some language codes, actually – one with the CLDR language name, and one with our override. For example:

en.json
{
	"cldr-languagename-gan": "Gan Chinese",
	"cldr-languagename-gan-override": "Gan"
}
qqq.json
{
	"cldr-languagename-gan": "CLDR language name of the 'gan' language code. This language name is translated in the CLDR database. Any changes you make here will be overwritten the next time the CLDR data is updated; please also submit your changes to the CLDR database. [TODO links to useful documentation]",
	"cldr-languagename-gan-override": "MediaWiki override of the 'gan' language code, replacing {{msg-mw|cldr-languagename-gan}}. We override this language code because [TODO]. Please consider also submitting a CLDR translation of {{msg-mw|cldr-languagename-gan}} if it does not exist yet."
}

We would only define these -override messages for the messages where we actually need an override; we don’t want people to use this to make changes to CLDR that would better be updated upstream (if I understood @Nikerabbit correctly).

I slapped together a script to put the CldrNames and LocalNames into i18n JSON files: P61836

I still need to look into the output; there are generally move -overrides messages than I expected, and some language codes (e.g. be-tarask) seem to only contain -overrides, which is curious.

So is the idea that CLDR will populate both the languagename-xx and languagename-xx-override messages, and translators in Translatewiki will have access to changing/adding translations to the languagename-xx-override messages only?

So is the idea that CLDR will populate both the languagename-xx and languagename-xx-override messages, and translators in Translatewiki will have access to changing/adding translations to the languagename-xx-override messages only?

As I understand it now, they would be able to add or change languagename-xx messages too, but the message documentation would show a warning for it. (But I think there’s a legitimate use case for it – sending a translation to translatewiki and CLDR at the same time, and then having it in production via translatewiki without having to wait for the next CLDR release.)

some language codes (e.g. be-tarask) seem to only contain -overrides, which is curious.

AFAICT this actually only affects two language codes. be-tarask exists in CLDR (core/common/main/be_TARASK.xml), but I think rebuild.php doesn’t turn it into CldrNamesBe_tarask.php (might not be intentional?); sco doesn’t exist in CLDR AFAICT.

I think I might be conflating several use cases of “overrides” messages here… let me think about this a bit more.

As I understand it now, they would be able to add or change languagename-xx messages too, but the message documentation would show a warning for it. (But I think there’s a legitimate use case for it – sending a translation to translatewiki and CLDR at the same time, and then having it in production via translatewiki without having to wait for the next CLDR release.)

I think a way to solve this would be to split the language names i18n folder into two:

  • i18n/language_names_cldr/en.json
  • i18n/language_names_overrides/en.json

The script could populate both of them, but only the files in i18n/language_names_overrides get added to TWN. Does that make sense?

Change #1026922 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/cldr@master] Load CLDR localization for 'be-tarask'

https://gerrit.wikimedia.org/r/1026922

I think a way to solve this would be to split the language names i18n folder into two:

I don’t fully understand this yet – I don’t see this as a problem that needs solving :D what would be the benefit of having language_names_cldr if we’re not putting it on TWN? Is it just that JSON is a nicer data format than PHP?

I think I might be conflating several use cases of “overrides” messages here… let me think about this a bit more.

Trying to tease these apart, I’ve found three different cases so far:

  1. genuine overrides where we want to replace the CLDR name, like und (make it lowercase) or gan (remove “Chinese”)
  2. cases where the language the name is in isn’t in CLDR (only sco / Scots, I think – the above change takes care of be-tarask)
  3. cases where the language the name describes isn’t in CLDR (e.g. rhg-rohg, Rohingya in Hanifi Rohingya script)

The first and third case are technically pretty similar – in both cases we’ll want a message in en.json and translations in lots of languages. But I think it might make sense to put them in different messages, with the CLDR names “between” them in precedence – something like: languagename-CODE-override takes precedence over languagename-CODE which takes precedence over languagename-CODE-local. When support for a language is added to MediaWiki, we define languagename-CODE-local and let people translate that; if the language is later added to CLDR, then languagename-CODE starts to exist and languagename-CODE-local becomes obsolete.

The second case (sco) feels more different, and I’m tempted to leave that in PHP files, at least for now. (If we only want these language names in sco and no other language, it doesn’t really make sense to put them on TWN, does it?)

Just curious:

I remember we already suggest not to concatenate words without hyphen in new message keys including prefixes (i.e., language-name-* instead of languagename-*)?

  1. cases where the language the name is in isn’t in CLDR (only sco / Scots, I think – the above change takes care of be-tarask)

I was wrong – the languages with LocalNames but no CldrNames are: gom-latn, hts, mnw, sco, sh, sjd, sje. (As far as I can tell, they’re all completely absent from CLDR’s common/main/*.xml, i.e. they’re not like be-tarask where we just weren’t loading them.)

I remember we already suggest not to concatenate words without hyphen in new message keys including prefixes (i.e., language-name-* instead of languagename-*)?

I don’t know about that, but I’m happy to go with language-name too :)

Note: hts is not currently supported in MediaWiki core, translatewiki or language-data/ULS.

Change #1026922 merged by jenkins-bot:

[mediawiki/extensions/cldr@master] Load CLDR localization for 'be-tarask'

https://gerrit.wikimedia.org/r/1026922

As I understand it now, they would be able to add or change languagename-xx messages too, but the message documentation would show a warning for it. (But I think there’s a legitimate use case for it – sending a translation to translatewiki and CLDR at the same time, and then having it in production via translatewiki without having to wait for the next CLDR release.)

I think a way to solve this would be to split the language names i18n folder into two:

  • i18n/language_names_cldr/en.json
  • i18n/language_names_overrides/en.json

The script could populate both of them, but only the files in i18n/language_names_overrides get added to TWN. Does that make sense?

I didn’t fully understand this proposal yesterday and talking to @jhsoby now clarified it :)

Under this proposal, the CLDR names go into both of those JSON files. So the “overrides” file is really the combination of CLDR + any overrides we have. The “CLDR” file doesn’t need to go on TWN, and we don’t even use it at runtime; the “overrides“ file is what we use at runtime. I think this proposal can cover all the three kinds of “overrides” I proposed in T231755#9767672; case 3 corresponds to language names that are only in the “overrides” files but not in the “CLDR” files at all.

When we’re upgrading to a new CLDR version, we would write the language names there to the “overrides” file iff the language name there matches the previous version (which we can get from the “CLDR” JSON); then we would overwrite the “CLDR” JSON with the new CLDR data.

What’s still tricky: when the language name in the “overrides” file isn’t the same as the previous CLDR language name, but the new CLDR lanugage name is also different from the previous one (i.e. we were already overriding CLDR, but now CLDR changed “under” us), @jhsoby says it would be great to mark the translation in TWN as fuzzy. (Sounds fine to me, but I don’t know much about fuzzy ^^) But there’s no way to do that yet.

Note: hts is not currently supported in MediaWiki core, translatewiki or language-data/ULS.

This one was added in an attempt to fix T303379 (but as I pointed out in T303379#7965895, it would have to be added somewhere else to make it work like they wanted it to)

I made a new version of the script which dumps the combined CLDR and local names into the i18n JSON files: P61867

(The separate CLDR-only JSON should be emitted by rebuild.php instead, that’s still TBD.)

This also revealed that there are some language codes that have language names only in some non-English languages; no has 99 “extra” language names, sh nds and frr have 65 each. (Details in this Gist, or even more details in this other gist, though the additional details in the second one might not actually be useful, to be honest.) We should probably add those language codes to LocalNamesEn first – I assume TWN will be unhappy if there are messages missing from en.json.

As Wikimedia-Hackathon-2024 approaches the final showcase, I want to summarize the current plan (which we didn’t get to implement yet):

  • We want to add missing names to LocalNamesEn first, see T231755#9772765.
  • We want to add interface messages for language names; they will be populated from CldrNames and LocalNames combined (i.e., LocalNames with fallback to CldrNames). These messages will all be translatable on TWN. (The script at P61867 goes in this direction but isn’t done yet.)
  • \MediaWiki\Extension\CLDR\LanguageNames::loadLanguage() will be made to load language names from interface messages. The PHP files will probably not be used at runtime, and the LocalNames files may mostly go away. (Note that the CldrNames files contain other data in addition to $languageNames – they also have $currencyNames, $currencySymbols, $countryNames, and $timeUnits, and a very small handful of LocalNames files also have these variables. I don’t think we’re planning to move those to messages, so this data may stay in PHP.)
  • rebuild.php and CLDRParser will be modified to:
    • load the old CLDR data from JSON files that are committed with the repository;
    • load the new CLDR data from the extracted zip file;
    • load the language name messages;
    • for each language name: if the message is identical to the old CLDR name, and the new CLDR name is different, change the message to the new CLDR name;
    • write all those messages back to i18n files, and then write the new CLDR data to those JSON files.
  • We can test the new rebuild.php version with an upgrade to CLDR 45, which was recently released and not yet integrated into the extension. (Hopefully there aren’t a lot of other disruptive changes in CLDR v45.)
  • Later, we also want to have some way for rebuild.php to mark translations as fuzzy if the old CLDR name, new CLDR name, and message language name are all different; however, this might not be included in the first version.

Also, we already made several improvements to the cldr extension during the hackathon: I attached two Gerrit changes to this task (update license, load be-tarask), and several other unrelated changes were merged (fix dash vs. hyphen for cpx, update LocalNamesMnw, more LocalNamesEn) or at least uploaded (fix mni script, add LocalNamesMni) during this weekend by various people.

  • load the language name messages;
  • for each language name: if the message is identical to the old CLDR name, and the new CLDR name is different, change the message to the new CLDR name;

In addition, rebuild script would create not-yet-exist messages (language names) from new versions of CLDR data (this is current functions of rebuild which should be perserved). This include:

  • Additional names of existing languages in new languages (e.g. names of "English" in Hausa)
  • names of new languages (e.g. names of "Dagbani")
  • names in entirely new languages that have no CLDR support before (i.e. new files of CLDR). This will include:
    • languages that MediaWiki supports (which currently loads as normal)
    • languages that MediaWiki currently do not support - currently we do not load them, so when a new language is added to core, it will not be included in CLDR until next rebuild. So we may instead also load them. Core MediaWiki use an export threshold of 13%, but extensions use 0%, i.e. localizations of extensions are always exported even they are not supported by MediaWiki core. If we add a message file in a language supported by translatewiki, it will similarly be imported to translatewiki.
  • Also, for language supported in CLDR, a qqq message can be added to direct user to upstream changes to CLDR. Instead of plain wikitext, we may instead create two templates {{CLDR language}} and {{CLDR language overrided}} in translatewiki, so they will be more easy to maintain.