Page MenuHomePhabricator

Decide whether a Lexeme's lemma can have multiple representations for the same language code.
Closed, ResolvedPublic

Description

We decided to have multi-variant lemmas on Lexemes, see T151582. That is, we support multiple representations (spellings, scripts) of the lemma.

This again raises two options:

  • allow only one representation per language (in PHP, that would be a TermList; In JSON this would be a simple object, using language codes as the keys and terms as values)
  • allow any number of representation per language (in PHP, that would be an AliasGroupList; In JSON this would be an object with language codes as the keys but lists of terms as the values)

The advantage of one-per-language is that it is easier to use: we can apply the same language fallback we use for Item labels, and get a single string. The disadvantage is that we may invent language codes to cover regional differences, dialects, and changes over time. We may want to use Item qids instead of ISO code to overcome this, but we have to map these to ISO codes at least for use in HTML and RDF. We could also go with a hybrid approach, ISO language codes suffixed by qids, e.g. de-au.Q131964. The suffixes could just be stripped for use in HTML and RDF, but we'd need a rather complex widget for picking and editing the language code.

Alternatively, we may allow any number of representations with the same language code. This is what the Lemon model does: it allows a set of arbitrary representations, with no restrictions on the language markers. This adds complexity for consumers that need to single value: even after finding the correct group by applying language fallback, they would have to pick one member of the group at random, or concatenate them. The advantage of this approach is that we can rely on a closed set of language codes, for which we can assume support by clients.

See also:
T151626: Investigate and decide the representation of languages and variants in Lexeme entities

NOTE: This needs to be decided for the canonical representation of Lexemes before going live. The multi-value JSON representation is forward-compatible, while the single-value JSON structure is not.

Related Objects

Event Timeline

Is a dialect a language for the purposes of this discussion?

Lydia_Pintscher claimed this task.
Lydia_Pintscher moved this task from Backlog to Done on the Wikidata-Former-Sprint-Board board.
Lydia_Pintscher subscribed.

Decision: Yes in the future but for now we only allow one.

@ChristianKl: Yes. Sorry for answering only now. The previous reply didn't get sent it seems.