Page MenuHomePhabricator

Specify the use of extended language codes in Lexemes
Open, HighPublic


Since Lexemes represent parts of natural language, we will need codes that allow us distinguish between different variants of a language in a fine grained way. This will be needed for Lexeme lemmas in particular, but also for the representations of Forms, to capture region, period, dialect, etc. Note that a Lexeme can have any number of lemmas, but all lemmas of a Lexeme have to have a different language variant associated.

The ISO language codes used in Items will not be sufficient. There are several interlocking issues to consider:

Internal Representation of Language Variants

Since Wikibase Terms use (ISO) language codes to represent a term's language, it seems appropriate to continue doing this. The representation should follow RFC 5646 as suggested by this W3C article.

On the other hand, since with Wikidata we have a community maintained universal vocabulary, it makes sense to make use of Items (or item IDs) to identify the language variation.

RFC 5646 gives us considerable freedom to this. One option would be:

  • if the selected item has a language code associated using P424, use that (and hope it's unique?)
  • else, use the Lexeme language's code (it SHOULD have one) and extend it by adding "-x-" and the the Item ID. For example "de-x-Q1205".

However, this may fail due to the fact that Item IDs may have more then 8 characters, and RFC 5646 only allows 8 characters per section of the code. The relevant production for language tag extensions according to RFC 5646 is singleton 1*("-" (2*8alphanum)) in ABNF. In PCRL that would be \w(-\w{2,8})+.

Of course, Wikimedia could apply with IANA for a "q" singleton to be registered for Wikidata, so we could use "de-q-1205". But we would still run into issues with the length of the decimal item ID. Base 48 could help, but would be ugly. Or the ID could be split, as in Q1234-5678. But that may cause confusion with structured entity IDs which also use dashes as separators, as in L234243-F5.

Representation of Language Variants in Output Formats

Wikibase exposes the language codes of terms as native features of an output format in three places:

  • in the lang tag of HTML output
  • lang tags of RFD XML output
  • @-suffixes on RDF literals in turtle/n-triples output

Question: should we expose the full code here, inccluding the x-extension, or should the "special" bits be stripped, for improved interoperability? What's the status of support for such codes in 3rd party tools?

Choosing a Language Variant in the UI

How should input for an extended languagwe code function?

  • Does the user select an Item? Do we allow any Item, or just ones that have a value for P424 ("Wikimedia language code")?
  • If the Item has no code assigned, do we combine it with the Lexeme's language's code? Can we guarantee that this is defiend?
  • Should the use instead select a language (code), and then optionally pick an item in addition to that?
  • If the selected item has the base code itself assigned, do we still include the Item ID in the code? That is, do we use "de-x-Q188", or reduce it to just "de"?

Event Timeline

Is this ticket related to T195740? *I am not sure so I do not link both tickets but if you think so, please do it.