Page MenuHomePhabricator

Investigate and decide the representation of languages and variants in Lexeme entities
Open, MediumPublic

Description

We need to specify:

  • the "language" of a Lexeme
  • the "language" of each Term of a multi-variant lemma (and Term representation, and Sense glosse)

There are essentially two options for representing a language or variant:

  • an ISO code, or a IETF language tag
  • a Q-Item reference (possibly mapped to an ISO code or IETF tag somehow)

There are several kinds of variants, mainly:

  • spellings
  • scripts
  • dialects

Of these combinations:

  • some have valid IETF tags
  • others have a valid fall-back tag
  • others don't have valid tags (or would have internal tags)

There are several aspects that need to be considered:

  • (A) input of language/variant for a term
  • (B) definition of applicable IETF tag/update of tag for a language variant
  • (C) output

(C) For representing lemmas (and representations and glosses) in HTML and RDF, we need IETF tags ISO language codes. We can put ISO codes int Statements of items about languages, but then these statements must not change. We can make up ISO codes from Q-Ids, e.g. qqd-Q7832478, but that's not very useful, since consumers don't understand them.

Currently, monolingual text uses:

  • (A) input in the form of code or selection from list of names
  • (B) definition in a special (internal table), mostly based on existing interface language and update through a somewhat time-consuming and opaque procedure. Only a few codes have accordingly been added.
  • (C) storage in the form of codes

Units (for quantities) might initially have followed a model similar to monolingual text, this has then been simplified to model using QIDs.

See also https://lists.wikimedia.org/pipermail/wikidata-tech/2016-November/001070.html and following.

Event Timeline

The French Wiktionary is currently using the fixed list approach for languages, administered by the community. We use ISO codes to represent languages almost everywhere (e.g. the title of a language section is {{langue|fr}}, which is displayed as "French"). We do not have identifiers for dialects and we usually use a geographic term. Not sure if those should be as controlled as languages.

I expanded the description above. Here a few precisions and an explanation of a possible solution.

  1. The other day we looked into the way that the dialect of Finnish place names should be specified (discussion, WP article on some).
  2. It seems that a few may have ISO codes/IETF tags, others don't. In the current way monolingual text codes work, I think one would first have to ask the relevant IETF appointed administrator for these tag to define a variant tag and then wait for WMDE to add them. Alternatively, if we leave the interface open as we do for new item, one could just make up a code on the fly. The later isn't really desirable.
  3. To ensure that the variant is identified, one would need to define it through a QID of an item. This as not all variants have IETF-tags or have IETF-tags known to Wikidata.
  4. Whenever possible, output should use IETF tags. These could be defined on the item of the language.
  5. If no IETF is defined, the item of the language could also offer a fallback. (just "fi" instead of "fi- hämäläismurteet"). Possibly a fallback could also be to "fi-x-QID".
  6. Once in a while, we could request new IETF tags for variants that need them.
  7. To simplify input by users (and bots/tools), it should be possible to input IETF tags (e.g. "de") instead of a QID.