Page MenuHomePhabricator

Find a good way to represent multi-lingual text fields in Elastic
Closed, ResolvedPublic

Description

In order to store Wikibase terms, such as labels and aliases, in Elastic, we need to find a good way to represent multi-lingual values.

That representation has to support language fallback: If de-ch falls be to de, then searching for de-ch:Haus should also find de:Haus (and possibly also vice versa).

Note that the term representation in Elastic is not merely intended a search index, but also for retrieving all labels/descriptions for a given subject.

Implementation ideas:

Language fallback support can be achieved using index expansion (indexing de:Haus also as de-ch:Haus) or query expansion (a search for de-ch:Haus turns into a search for de-ch:Haus or de:Haus). Index expansion requires more space, and query expansion requires more time.

A compromise could be a multi-value "all languages" field in addition to the per-language fields. This would make it possible to implement language fallback programmatically, without greatly increasing storage size and schema complexity.

For instance: If there is only an english label, and all languages fall back to english, and we have 100 languages configured, index expansion would store the english label 100 times. The all-languages approach would store it twice.

However, all-languages needs two queries (one for the exact match, and one for all-languages), and the second can potentially have a large result set to process. Simple query expansion also rarely needs more than two queries. However, all-languages provides a cheap way to get all labels in all languages.


Use case 1: Find entities of a specific type that have a label or alias that fits some input as a completion match (prefix match) in a given language or one of the associated fallback languages. With the result, provide the description of the matched entities in the given language (or one of the fallback languages). If fallback applies, also report back the actual language of the description and label or alias. The result should be ranked by relevance, based on the entities weight and the quality of the match.

Use case 2: Get the label and description of a given entity in a given language (or one of the fallback languages). If fallback applies, also report back the actual language of the description and label or alias.

Use case 3: Get a set of entities (possibly filtered by entity type) that match (fully text, anywhere) some user input in a given language. Several fields should be considered, including statement values (with low weight, except for external ids, which should have high weight) and site links (with high weight, and extra boost if they match the language), as well as labels and aliases (with high weight, and extra boost if they match the language), and descriptions (with low weight).

Related Objects

StatusSubtypeAssignedTask
ResolvedWikidata-bugs
OpenNone
Resolvedaude
ResolvedSmalyshev
Resolvedaude
ResolvedNone
InvalidNone
ResolvedSmalyshev
ResolvedLydia_Pintscher
DuplicateSmalyshev
DuplicateNone
DeclinedNone
DeclinedNone
Resolveddaniel
ResolvedLydia_Pintscher
OpenNone
DeclinedNone
ResolvedSmalyshev
ResolvedSmalyshev
DeclinedNone
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse
Resolveddcausse
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse

Event Timeline

There's also a concern that different languages may need different analyzers - such as case folding, diacritics folding, etc. for Russian, Turkish, English and French may be very different, and word boundaries may work differently in Japanese/Chinese. This makes putting many languages into one field and searching there tricky, if we want to use facilities that ElasticSearch provides, such as analyzers/tokenizers. Without them, one would have to match the exact spelling of the term with all capitalization, diacritics, etc. which may not produce desired results.

Grouping similar languages may be ok, but one has to note that being in the fallback chain not necessarily means same rules apply - French may fall back to English but have different rules for folding.

@Smalyshev What about one field per language, then? Is that feasible? We sould match one field OR a second OR a third... I think the max size of a fallback chain is about 10 languages (for all the chinese variants falling back on each other).

There are tons of possibilities and the solution highly depends on the usecases you'd like to support. I think more precise examples would definitely help.

Note that the term representation in Elastic is not merely intended a search index, but also for retrieving all labels/descriptions for a given subject.

@daniel can you elaborate on this point?

As pointed by Stas Index expansion might not be doable if we plan to leverage language specific analyzers, it's not possible to mix different index analyzers on the same field.

Going with a one field per language approach is certainly doable, all the usecases you'd like to support are still unclear to me but the following setup could maybe work:

  • a plain all field with ICU tokenization to support exact match, all languages could be merged into this single field. We would have to verify that term collisions between languages are not causing too much trouble.
  • a field per language with language aware analyzers (stem support if available). The use of copy_to can automically copy this content to the plain all field.

The input doc would look like:

{
  labels: {
    en: ["This entity", "Entity"],
    fr: ["Cette entité"],
    ...
  }
}

A query (for a swiss german) would look like (assuming you want to always fallback to english stems, remove the labels.en if you're ok to fallback only on exact matches):

labels_all:query^0.5 OR labels.de_ch:query^2 OR labels.de^1 OR labels.en^0.5

Concerning perf it's hard to tell but we recently switched to a perfield (we query 14 fields) builder on the top 10 wikis and it seems to be OK so far.

ICU tokenization is important here as it's extremely convenient: it tokenizes text by first detecting the script and then applies custom tokenization, e.g. It detects a trad chinese script it switches to a dictionary based tokenizer. Drawback is that it can split words written using mixed scripts (e.g. ßeta => ["ß", "eta"]

@aude could you add a link to the experiment you started? I remember that it was going in the right direction.

Overall, given all the possibilities and language diversity it's really hard to anticipate thus I'd suggest to invest more time in experimenting various techniques.

@dcausse I added use cases to the ticket description

@dcausse I added use cases to the ticket description

Thanks!

I think we need to distinguish 2 very different search usecases:

  1. Autocomplete

Looking at the current behavior it seems that you display exact matches first and then prefix matches.
It means that the prefix lookup must include a fullmatch lookup, e.g.
typing li will display

  • all entities with labels or aliases that perfectly match li, e.g. Lithium with alias "Li"
  • all entities with labels or aliases that starts with li, e.g. Life

In addition to a prefix field you need a untokenized field in order to promote exact matches first.
Since prefix and fullmatch fields do not require fancy language features (no tokenization required) do you think it's still important to break by language?
Breaking by language would only be needed for ranking, when 2 entities are ambiguous always prefer the match that comes from a language field close to the user language.
It can become rather complex since we have two competing matches, assuming I'm french would I prefer an exact match in english or a prefix match in french?
Do we have enough ambiguities to really care about that? Would a simple solution where we merge all languages into the same field be sufficient?

  1. Fulltext

My previous comment was targeting this particular usecase.

Use case 2: Get the label and description of a given entity in a given language (or one of the fallback languages). If fallback applies, also report back the actual language of the description and label or alias.

To clarify: when you say of a given entity the input is an entity ID or a search string?
If we refer to a search string then it's a matter of highlighting and displaying the best part of the entity that fit the search query.

To clarify: when you say of a given entity the input is an entity ID or a search string?

Yes. Ideally, a list of potentially many entity IDs. We will need to know which result corresponds to which ID.

If we refer to a search string then it's a matter of highlighting and displaying the best part of the entity that fit the search query.

Use case 2 is not about search, highlighting etc is not required (that would be nice for use case 3). The input is not user generated. Use case 2 is about using Elastic for batch lookup of the labels and description of potentially many entities at once. Use case 2 is not an absolute requirement - we are looking for an alternative to our current sql based solution for batch term lookup. Elastic seems like a good option to explore, but we are not committed to it.

@dcausse I added use cases to the ticket description

  1. Autocomplete

Looking at the current behavior it seems that you display exact matches first and then prefix matches.

We actually do up to four queries at the moment, until we have found enough matches to fill the desired limit:

  1. full length case insensitive match, user language only
  2. full length case insensitive match, fallback languages
  3. prefix match, user language only
  4. prefix match, fallback languages

We currently rank by a crude heuristic score: max( |sitelinks|, |labels| ).

In addition to a prefix field you need a untokenized field in order to promote exact matches first.

Doesn't prefix also require untokenized?

Since prefix and fullmatch fields do not require fancy language features (no tokenization required) do you think it's still important to break by language?

Yes, we want to ignore, or at least strongly demote, languages that the user is not known to speak.

Breaking by language would only be needed for ranking, when 2 entities are ambiguous always prefer the match that comes from a language field close to the user language.

Indeed

It can become rather complex since we have two competing matches, assuming I'm french would I prefer an exact match in english or a prefix match in french?

See the algorithm described above.

Do we have enough ambiguities to really care about that? Would a simple solution where we merge all languages into the same field be sufficient?

I do not think it would be sufficient. I think that the result would often get swamped with results that are irrelevant for the user, and worse, impossible to read and interpret, especially for short prefixes like "li".

However, I have no research to support this, and I don't know how we would conduct such research. It boils down to a product level UX choice, so this is something to ask @Lydia_Pintscher and @Jan_Dittrich about.

Ah, a note about priorities: use case one (completion match) is by far the most pressing need for us. It puts massive load on the DB server, and it's triggered several times whenever a user uses a search field.

In addition to a prefix field you need a untokenized field in order to promote exact matches first.

Doesn't prefix also require untokenized?

Scarily, no that's not how it works. Instead the tokenization that happens and gets pushed into the posting lists is:

curl -XPOST http://search.svc.eqiad.wmnet:9200/enwiki_content/page/_mtermvectors -d '{
        "docs": [
                {
                        "doc": { "title": "example of prefix tokenization" },
                        "fields": ["title.prefix"],
                        "positions": false,
                        "offsets": false,
                        "term_statistics": false,
                        "field_statistics": false
                }
        ]
}' | jq '.docs[0].term_vectors."title.prefix".terms | to_entries | map(.key)' 

[
  "e",
  "ex",
  "exa",
  "exam",
  "examp",
  "exampl",
  "example",
  "example ",
  "example o",
  "example of",
  "example of ",
  "example of p",
  "example of pr",
  "example of pre",
  "example of pref",
  "example of prefi",
  "example of prefix",
  "example of prefix ",
  "example of prefix t",
  "example of prefix to",
  "example of prefix tok",
  "example of prefix toke",
  "example of prefix token",
  "example of prefix tokeni",
  "example of prefix tokeniz",
  "example of prefix tokeniza",
  "example of prefix tokenizat",
  "example of prefix tokenizati",
  "example of prefix tokenizatio",
  "example of prefix tokenization"
]

@Lydia_Pintscher How does this rank on your road map? Getting this done would be really helpful - in particular, it would prevent the terms table from blowing up in our face one fine Friday evening...

Yeah there are a few other things that need the move to Elastic (for example better ranking of suggestions). So if we can move this forward with WMF help that'd be awesome.

quick draft of a working session with @Smalyshev

(only addresses completion search for now)

@aude would it be ok if I continued this from here?

While reading elastic5 breaking change notes I realized that they've added a hard limit on the number of fields in the mapping. The limit is 1000 by default. This limit can be increased by changing the config but we might still want to think of an alternative here just in case.
The idea would be to move the language bits at a lower level: in the content directly:


It's unclear to me what would be the best approach here.
The advantage of language specific fields is that we are able to tweak the analysis chain, for completion search the only thing I could think of is tuning the list of diacritics we want to fold, e.g. do not fold ö to o for finish but because of the fallbacks I'm not sure it makes sense anyway.
The advantage of non specific fields is that we do not have to change the mapping when we add a new language, everything is data.

Concerning fulltext it's unclear yet, but the idea would be to create language specific fields only on languages we know we have a "good analysis" chain. But the details for fulltext ranking regarding language are still unclear to me.

Current plan, as agreed on the meeting:

  • Create a mock schema with language fields for current Wikidata
  • Create a script to quick-produce a data for this scheme from JSON dumped entities
  • Load it into the DB cluster and see whether it causes any issues and whether we can search it efficiently

If the test works, we try to do with per-language fields, otherwise we fall back to single-field.

So far per-language fields seem to work fine, so I think we can proceed with this scheme.

Is this affecting users on Wikidata, or is it infrastructure to build towards that? It's hard for me to tell from the task description and comments. I'd like to know whether to include it in the Discovery weekly update.

@Deskana it is not affecting the users immediately. This particular ticket just talks about finding the right format. Then we have to implement it (ongoing), deploy it, test it, figure out proper scoring, turn it on as replacement for the current search - then we can announce it. Right now it's no more than "now we have a good idea how we want to do this thing and proceeding to doing it".

@Deskana it is not affecting the users immediately. This particular ticket just talks about finding the right format. Then we have to implement it (ongoing), deploy it, test it, figure out proper scoring, turn it on as replacement for the current search - then we can announce it. Right now it's no more than "now we have a good idea how we want to do this thing and proceeding to doing it".

Understood. I left it out of the Discovery weekly update. Thanks!