We need to specify:
- the "language" of a Lexeme
- the "language" of each Term of a multi-variant lemma (and Term representation, and Sense glosse)
There are essentially two options for representing a language or variant:
- an ISO code, or a IETF language tag
- a Q-Item reference (possibly mapped to an ISO code or IETF tag somehow)
There are several kinds of variants, mainly:
- spellings
- scripts
- dialects
Of these combinations:
- some have valid IETF tags
- others have a valid fall-back tag
- others don't have valid tags (or would have internal tags)
There are several aspects that need to be considered:
- (A) input of language/variant for a term
- (B) definition of applicable IETF tag/update of tag for a language variant
- (C) output
(C) For representing lemmas (and representations and glosses) in HTML and RDF, we need IETF tags ISO language codes. We can put ISO codes int Statements of items about languages, but then these statements must not change. We can make up ISO codes from Q-Ids, e.g. qqd-Q7832478, but that's not very useful, since consumers don't understand them.
Currently, monolingual text uses:
- (A) input in the form of code or selection from list of names
- (B) definition in a special (internal table), mostly based on existing interface language and update through a somewhat time-consuming and opaque procedure. Only a few codes have accordingly been added.
- (C) storage in the form of codes
Units (for quantities) might initially have followed a model similar to monolingual text, this has then been simplified to model using QIDs.
See also https://lists.wikimedia.org/pipermail/wikidata-tech/2016-November/001070.html and following.