Page MenuHomePhabricator

Decide on modelling for a type for natural languages (for ZMonoLingualString, etc.)
Closed, ResolvedPublic

Description

The language on MonoLingualString shouldn't be a Z6/String, but a proper type. This will allow us to type check the field and to support better entry of the field.

What would be the right type? Given how we are using these strings, here are a few candidates:

  1. MediaWiki interface language
  2. Wikipedia project code
  3. MediaWiki content language / Wikipedia project language
  4. Wikidata Lexeme language
  5. Wikidata label language
  6. BCP 47 language (or some other appropriate external standard)
  7. A new language object for Wikilambda

It is likely that these different definitions of language will all appear at one point or the other.

Right now we are using the language codes for representing labels for types, keys, etc. Given that, these language codes must actually align with how MediaWiki treats languages. So we probably should go for interpretation 1 given above.

Given that, MonoLingualString and MultiLingualString need to be understood as being mono- or multi-strings based on the MediaWiki interface language. Maybe a more obvious name would make sense, but since we don't have any of the other interpretations yet, we can maybe avoid renaming that to something even more clunky (also, remember, the name will be editable in the wiki anyway - this about the name of the internal PHP Shadowclass).

Event Timeline

For a reference on the (exceptionally large) number of ways that Wikidata is currently handling languages, see Lea's draft table here:
https://www.wikidata.org/wiki/User:Lea_Lacroix_(WMDE)/List_of_lists_of_languages

We talked about this in the weekly task triage/prioritisation meeting. I discussed a few points; principally:

  • We should probably try to use BCP47 as the main standard around;
  • We need to know the MediaWiki "code" for our content so it can integrate into the UX and content request models of MW;
  • MW has a mechanism to convert back and forth from BCP 47 to MW "code";
  • MW has (from cldr) labels in ~every "code" for each MW "code";
  • We will want to represent content that BCP47 doesn't have a mechanism for (e.g. "languages" that are considered groups of languages, or not yet languages, by upstream); and
  • We will want to represent content that MediaWiki doesn't (e.g. support for en-IN vs. en-US vs. en will almost certainly be needed by our communities wanting to provide more credible content for different locales/experiences).

Consequently, I proposed that we will want a ZLanguage object type that has a key for BCP47 code and one for MW code, with the ability to inherit/redirect between then where no direct mapping exists; this should have the cldr / MW content transparently shown to readers like it was real on-wiki content, but shouldn't be editable (edits to be made through the cldr committee process, standard publications, and thence via git). Potentially this could be an entire class of objects, LObjects or whatever, given the special UX handling we'd want to enforce.

DVrandecic added a subscriber: Lea_Lacroix_WMDE.

Thanks @ArthurPSmith for the link to @Lea_Lacroix_WMDE 's writeup that is super useful. Thanks @Jdforrester-WMF for the considerations. I'll come up with two or three possible designs and then we can discuss them.

Change 666416 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/WikiLambda@master] Provide SQL for Natural language type (Z60)

https://gerrit.wikimedia.org/r/666416

Change 666416 merged by jenkins-bot:
[mediawiki/extensions/WikiLambda@master] Provide built-in JSON for Natural language type (Z60)

https://gerrit.wikimedia.org/r/666416

Jdforrester-WMF renamed this task from Have a type for the language on MonoLingualString to Decide on modelling for a type for natural languages (for ZMonoLingualString, etc.).Apr 1 2021, 11:20 PM