In order to store Wikibase terms, such as labels and aliases, in Elastic, we need to find a good way to represent multi-lingual values.
That representation has to support language fallback: If de-ch falls be to de, then searching for de-ch:Haus should also find de:Haus (and possibly also vice versa).
Note that the term representation in Elastic is not merely intended a search index, but also for retrieving all labels/descriptions for a given subject.
Language fallback support can be achieved using index expansion (indexing de:Haus also as de-ch:Haus) or query expansion (a search for de-ch:Haus turns into a search for de-ch:Haus or de:Haus). Index expansion requires more space, and query expansion requires more time.
A compromise could be a multi-value "all languages" field in addition to the per-language fields. This would make it possible to implement language fallback programmatically, without greatly increasing storage size and schema complexity.
For instance: If there is only an english label, and all languages fall back to english, and we have 100 languages configured, index expansion would store the english label 100 times. The all-languages approach would store it twice.
However, all-languages needs two queries (one for the exact match, and one for all-languages), and the second can potentially have a large result set to process. Simple query expansion also rarely needs more than two queries. However, all-languages provides a cheap way to get all labels in all languages.
Use case 1: Find entities of a specific type that have a label or alias that fits some input as a completion match (prefix match) in a given language or one of the associated fallback languages. With the result, provide the description of the matched entities in the given language (or one of the fallback languages). If fallback applies, also report back the actual language of the description and label or alias. The result should be ranked by relevance, based on the entities weight and the quality of the match.
Use case 2: Get the label and description of a given entity in a given language (or one of the fallback languages). If fallback applies, also report back the actual language of the description and label or alias.
Use case 3: Get a set of entities (possibly filtered by entity type) that match (fully text, anywhere) some user input in a given language. Several fields should be considered, including statement values (with low weight, except for external ids, which should have high weight) and site links (with high weight, and extra boost if they match the language), as well as labels and aliases (with high weight, and extra boost if they match the language), and descriptions (with low weight).