Page MenuHomePhabricator

Creating a new lexeme always asks for spelling variant for languages without an ISO 639-1 code
Open, Needs TriagePublic

Description

In Special:NewLexeme, entering a language that has an ISO 639-1 code (e.g. English (en) or Portuguese (pt)) works as expected. But if one enters a language that only has an ISO 639-3 code (e.g. Cape Verdean Creole (kea) or Brazilian Portuguese) the "spelling variant" field is revealed and needs to be manually filled.

@Nikki told me that this behavior is related to T209282.

Event Timeline

Lydia_Pintscher subscribed.

Yes we do need the ISO code so this is the intended behaviour for the languages where we can't get a code in another way.

@Lydia_Pintscher I'm reopening this because it could get the code in another way, it just doesn't.

It currently uses P218 which is only two-letter codes. However, MediaWiki uses IETF language tags (with a few exceptions that people are trying to fix), which are a mixture of two-letter codes (ISO 639-1), three-letter codes (ISO 639-3) and two/three letter codes followed by variant subtags (e.g. country, script). The two-letter codes are barely a third of our list, so it fails to match the majority of them.

The data is there though - we have a property for ISO 639-3 codes (P220) and for IETF language tags (P305).

Also, in my experience, the way it currently behaves is extremely confusing for users and has led to multiple people thinking we don't support a language when we actually do (as long as you select the language twice), so I would really like to see this fixed.

Lucas_Werkmeister_WMDE renamed this task from Creating a new lexeme always asks for spelling variant for languages without an ISO 936-1 code to Creating a new lexeme always asks for spelling variant for languages without an ISO 639-1 code.Jun 16 2021, 1:31 PM
Lucas_Werkmeister_WMDE updated the task description. (Show Details)

It looks like our Brazilian Portuguese item doesn’t actually have an ISO 639-3 code statement? The only language codes I see are IETF language tag (pt-BR) and POSIX locale identifier (pt_BR).

Therefore there would be a proposed property dedicated to this.

@Denny @daniel Can you chime in on this? (Becoming more pressing because of the upcoming work on the rework of Special:NewLexeme and @Mahir256's comment at T298142#7585471)

For the record, I don't remember what specifically went into the decision to rely on ISO-639-1. I have tried to gather some information around the topic. here are some thoughts and observations:

  • Allowing both ISO-639-1 and ISO-639-3 would mean we'd end up with multiple identifiers for the same language. French could use either "fr" or "fra", and if we allowed ISO-639-2b as well, even "fre". That way, we may end up with multiple terms in the same language, using different code. We'd need to manage a list of aliases for this to work properly.
  • Using P9753 seems like a workable solution, though it kind feels like cheating to me... since it refers to Wikidata itself. Care should be taken to make sure these codes are consistent, unambiguous, and do not change. They should also be consistent with existing standards like BCP47. Also, perhaps this should be merged with P424 (Wikimedia language code)?
  • The NewLexeme form is very confusing to me... how would I enter something like de-x-Q2031873 to represent "German with spelling per the 1901 conventions, before the 1996 reform"? Why can't I enter a variant for a known language? To specify that the lemma is in Serbian with Cyrillic spelling, would I say that the language is Cyrillic Serbian (Q21161942)? But that isn't a language... The language should be Serbian (Q9299), and the variant should be selected based on Q21161942. Perhaps the language item should have a property that can refer to possible spelling variants, so they can be offered in the form? That would be nice.
  • The Wikidata data model does not specify which codes can be used in language tags. The conceptual model says "a short string for identifying languages, based on the language preference setting of logged in Wikipedia users. (This might be more similar to BCP 47 but is not necessarily the same either; it is more fine-grained than a GlobalSiteIdentifier) ".
  • The lexeme data model says "Note: the script of the Lemma items are indicated by the language script code, which should be a valid IETF language tag, although the current design is likely more restricted than the full spec.". IETF language tag is the same as BCP47 / P305.
  • If I had to design this again, I'd use just Q-Ids internally, and map to language code when generating HTML, RDF, etc.
  • HTML5 and XML require the lang attribute to be BCP47/RFC5646. The specs say "The lang attribute (in no namespace) specifies the primary language for the element's contents and for any of the element's attributes that contain text. Its value must be a valid BCP 47 language tag, or the empty string." and "The values of the attribute are language identifiers as defined by [IETF BCP 47], Tags for the Identification of Languages." respectively.
  • RDF Turtle requires language tags to be BCP 47: "Literals are composed of a lexical form and an optional language tag [BCP47] or datatype IRI.".
  • As far as I can determine from a browsing the spec, BCP 47 is a superset of ISO-639-1. It includes many codes from ISO-639-2 and ISO-639-3, but only if there wasn't an ISO-639-1 code for it, to avoid ambiguity (see section 2.2.1 item 6).
  • P305 is used in Wikidata to refer to BCP 47 language codes.

Given all of the above, I would recommends the following to determine the language code for a given item: check P9753 (explicit wikidata code), then fall back to P305 (BCP 47), then fall back to P218 (ISO-639-1). Do not use P220 (ISO-639-3), since that might introduce ambiguity. The result should be complaint with RFC5646 (i.e. a code from the BCP 47 list or a code that is compatible with the extension mechanism described in the RFC).

I would also suggest to always show both form fields: one for the language, and one for the spelling. The spelling field can be pre-filled if there is a single known spelling for the language (identified by some property on the language's item). A drop-down could offer all spellings based on such a property. But additionall spellings should be allowed, and the code determined per the above fallback mechanism.

If all else fails, we could still use mis-x-Pxxxx to generate a lanague code for any item, but that may be problematic if a language code is later introduced for that item. All terms would have to be re-tagged.

Thank you, Daniel!
So what do people think about Daniel's proposal? Would it work and be an improvement in your opinion? Reminder we are talking about this: "check P9753 (explicit wikidata code), then fall back to P305 (BCP 47), then fall back to P218 (ISO-639-1)"

I think it should always ask spelling variant. Many languages use multiple scripts and have multiple codes, and picking one automatically makes it seem like there is only one "preferred" variant. One should have the option to add a "tg" representation of a Persian lexeme first since Tajik is a dialect of Persian with a Cyrillic orthography, yet if a Tajik writer/typer wishes to add a lexeme this way, they would have to change the code manually after creating the lexeme. This may be a contributing factor to why Persian has yet to be successfully merged like other multi-script languages are for lexemes. That you have to select a variant for mono-scripts is mildly annoying at worst, and generally still requires less typing than adding a multi-script lexeme

Also, in my experience, the way it currently behaves is extremely confusing for users and has led to multiple people thinking we don't support a language when we actually do (as long as you select the language twice), so I would really like to see this fixed.

See for example T317193 where someone is asking for nso (which is a standard MediaWiki language) because the way it behaves makes them think it's not supported:

Sepedi (nso) lexemes can be added through a counterintuitive workaround by abusing the ’spelling variant of the Lemma’ box and typing nso.

So what do people think about Daniel's proposal? Would it work and be an improvement in your opinion? Reminder we are talking about this: "check P9753 (explicit wikidata code), then fall back to P305 (BCP 47), then fall back to P218 (ISO-639-1)"

I think it's unnecessarily complicated:

  • P9753 has only been used on 23 items and all of the values are either BCP 47 codes or can't be used on Special:NewLexeme.
  • All ISO 639-1 codes are BCP 47 codes and all items with P218 have the same value for P305.

That just leaves P305.

My suggestion would be:

  • change the config to use P305 instead of P218
  • make sure it performs a case insensitive comparison of the P305 value and the allowed language codes (BCP 47 is case insensitive)
  • P9753 has only been used on 23 items and all of the values are either BCP 47 codes or can't be used on Special:NewLexeme.

In my opinion this means we need to populate the property.

  • P9753 has only been used on 23 items and all of the values are either BCP 47 codes or can't be used on Special:NewLexeme.

In my opinion this means we need to populate the property.

Even fully populated, there would be almost zero benefit to including it. Of the 602 currently supported codes for lexemes, only 20 are not BCP 47 codes. Most of those already shouldn't be used and we're trying to replace the rest.

Regardless of what the "perfect" solution should be, can it please at least be changed from P218 to P305?

Pros:

  • One-line change (here)
  • Existing codes continue to work
  • Easier for editors to use, they won't have to search for the same language twice when the code has 3 letters
  • More logical behaviour, people don't expect it to depend on how many letters the language code has
  • More than double the number of supported language items immediately
  • Has potential to support almost 50 times more language items
  • I will stop being annoyed by it

Cons:

  • Someone will have to change the gigantic InitialiseSettings.php file...?

Some data:

Here is a query for language codes currently used for lemmas and whether they match P218 (ISO 639-1) (the current setting) or P305 (IETF) (what I'm proposing we use) on the linked language item.

Current statistics from that query:

ISO 639-1IETFNumber of languagesNumber of lemmas
truetrue1711110969
falsetrue23931133
falsefalse111384278

IETF language tags include all of ISO 639-1, so all items with P218 have the same value for P305 (query).

There are 185 items with ISO 639-1 codes in Wikidata. ​ISO 639-1 hasn't changed in 20 years and MediaWiki already supports all of them except ae, ak, oj, lu and nr.

There are 8544 items with IETF language tags in Wikidata.

496 of the 1113 in the bottom row are ones of the form mis-x-QID where the language does have an IETF language tag but it's not yet supported by MediaWiki/Wikibase (i.e. they will move to the middle row as more language codes are added to MediaWiki/Wikibase).