We decided to have multi-variant lemmas on Lexemes, see T151582. That is, we support multiple representations (spellings, scripts) of the lemma.
This again raises two options:
- allow only one representation per language (in PHP, that would be a TermList; In JSON this would be a simple object, using language codes as the keys and terms as values)
- allow any number of representation per language (in PHP, that would be an AliasGroupList; In JSON this would be an object with language codes as the keys but lists of terms as the values)
The advantage of one-per-language is that it is easier to use: we can apply the same language fallback we use for Item labels, and get a single string. The disadvantage is that we may invent language codes to cover regional differences, dialects, and changes over time. We may want to use Item qids instead of ISO code to overcome this, but we have to map these to ISO codes at least for use in HTML and RDF. We could also go with a hybrid approach, ISO language codes suffixed by qids, e.g. de-au.Q131964. The suffixes could just be stripped for use in HTML and RDF, but we'd need a rather complex widget for picking and editing the language code.
Alternatively, we may allow any number of representations with the same language code. This is what the Lemon model does: it allows a set of arbitrary representations, with no restrictions on the language markers. This adds complexity for consumers that need to single value: even after finding the correct group by applying language fallback, they would have to pick one member of the group at random, or concatenate them. The advantage of this approach is that we can rely on a closed set of language codes, for which we can assume support by clients.