Page MenuHomePhabricator

Optimize the WikibaseCirrusSearch elasticsearch mapping and filter query for non-english users
Closed, ResolvedPublic3 Estimated Story Points

Description

The way WikibaseCirrusSearch generates its elasticsearch mapping is sub-optimal for non-english users.
The approach taken by WikibaseCirrusSearch to deal with the multilingual nature of the wikidata content is to create a field per language for the labels field.
Having a subfield per language is costly in term elasticsearch resources but does allow a great level of customization, sadly WikibaseCirrusSearch does not take full benefits of it.

As of today the mapping for the labels.ko field is:

"labels": {
    "properties": {
        "ko": {
            "type": "text",
            "fields": {
                "near_match": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match"
                },
                "near_match_folded": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match_asciifolding"
                },
                "plain": {
                    "type": "text",
                    "analyzer": "ko_plain",
                    "search_analyzer": "ko_plain_search",
                    "similarity": "bm25",
                    "position_increment_gap": 10
                },
                "prefix": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "prefix_asciifolding",
                    "search_analyzer": "near_match_asciifolding"
                }
            },
            "copy_to": [
                "labels_all"
            ]
        }
}

This does index a field name labels.ko using the elasticsearch default text analyzer.

The mapping should look like:

"labels": {
    "properties": {
        "ko": {
            "type": "text",
            "fields": {
                "near_match": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match"
                },
                "near_match_folded": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match_asciifolding"
                },
                "plain": {
                    "type": "text",
                    "analyzer": "ko_plain",
                    "search_analyzer": "ko_plain_search",
                    "similarity": "bm25",
                    "position_increment_gap": 10
                },
                "prefix": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "prefix_asciifolding",
                    "search_analyzer": "near_match_asciifolding"
                }
            },
            "copy_to": [
                "labels_all"
            ],
            "analyzer": "ko_text",
            "search_analyzer": "ko_text_search",
            "similarity": "bm25",
            "position_increment_gap": 10
        }
}

And then when searching in korean the filter should be adapted to also query the labels.ko field and its language fallbacks like what's done for the descriptions.$lang field.

AC:

  • searching for a word in a label should use the language specific analyzers: e.g. searching for a korean word part of a label labelled as korean should yield search results (TODO: add a specific example here)

Event Timeline

Restricted Application added subscribers: revi, Aklapper. · View Herald Transcript

As of today the mapping for the labels.ko field is:
...
This does index a field name labels.ko using the elasticsearch default text analyzer.

After looking into it, we actually don't. I think what's happening is the html formatter for mediawiki api responses is stripping out the false values. If i check using the json formatter we can see the index is set to false, accessing the same api from a browser without the json formatting the "index": false is not included.

We can still add it, but worth noting this will be yet another indexed field added to wikibase mappings.

$ curl -s 'https://www.wikidata.org/w/api.php?action=cirrus-mapping-dump&format=json&formatversion=2' | jq .content.properties.labels.properties.ko
{
  "type": "text",
  "index": false,
  "fields": {
    "near_match": {
      "type": "text",
      "index_options": "docs",
      "norms": false,
      "analyzer": "near_match"
    },
    "near_match_folded": {
      "type": "text",
      "index_options": "docs",
      "norms": false,
      "analyzer": "near_match_asciifolding"
    },
    "plain": {
      "type": "text",
      "analyzer": "ko_plain",
      "search_analyzer": "ko_plain_search",
      "similarity": "bm25",
      "position_increment_gap": 10
    },
    "prefix": {
      "type": "text",
      "index_options": "docs",
      "norms": false,
      "analyzer": "prefix_asciifolding",
      "search_analyzer": "near_match_asciifolding"
    }
  },
  "copy_to": [
    "labels_all"
  ]
}

As of today the mapping for the labels.ko field is:
...
This does index a field name labels.ko using the elasticsearch default text analyzer.

After looking into it, we actually don't. I think what's happening is the html formatter for mediawiki api responses is stripping out the false values. If i check using the json formatter we can see the index is set to false, accessing the same api from a browser without the json formatting the "index": false is not included.

Thanks for checking! I'm tempted to decline, the purpose of ticket was to fix a missed opportunity and I think that adding yet another 500+ indexed fields is not something we could do without evaluating the impact and there might be other things (less costly) we could do to improve recall in languages other than English (e.g. use the icu tokenizer for the labels_all.plain field?).

Pinging @TJones for advises on this matter.

Thanks for checking! I'm tempted to decline, the purpose of ticket was to fix a missed opportunity and I think that adding yet another 500+ indexed fields is not something we could do without evaluating the impact and there might be other things (less costly) we could do to improve recall in languages other than English (e.g. use the icu tokenizer for the labels_all.plain field?).

Pinging @TJones for advises on this matter.

This wouldn't actually be 500 fields, at least for descriptions this is only indexed for languages in our stemmed languages list. It would add approximately 40 new indexed fields.

I really don't want to decline this ticket, because the current situation is very bad for spaceless languages and worse than it needs to be for any language with language-specific analysis.

I'd argue that it's more important for labels to have analysis than descriptions.

  • The original Korean example talked about cormorants (which are birds). In English, if I search for cormorant, among the results are many birds—Great Cormorant, Double-crested Cormorant, Little Cormorant, Japanese Cormorant, Pelagic Cormorant—and these all have the same description: "species of bird". What's the use of being able to search species of birds (plural) in the description but not search cormorant in the label? The labels are so much more useful content, so language analysis there is more useful, too.

Can we add the analysis to labels and remove it from descriptions to keep down the number of indexed fields?

Alternatively, we could push this out to next quarter or later and spend more time assessing whether we could afford to index both label and description with language-specific analysis.

(e.g. use the icu tokenizer for the labels_all.plain field?).

The ICU tokenizer would help with some languages for sure, but it specifically doesn't do anything useful for Korean—it splits Hangul on spaces, punctuation, etc, but doesn't break it into words at all.

So... some options (in order of my preference, which may not be everyone else's):

  1. Investigate whether we can afford to have language-specific analysis for descriptions and labels
  2. Swap language-specific analysis from descriptions to labels
  3. Enable ICU tokenization for labels (or descriptions if we go with the previous option)
  4. Maintain the status quo

This wouldn't actually be 500 fields, at least for descriptions this is only indexed for fields in our stemmed languages list. It would add approximately 40 new indexed fields.

Oh, I vote harder for my #1 option, then!

If it's 40 fields then #1 for me too :)

Change 864849 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Index and query stemmed labels in supported languages

https://gerrit.wikimedia.org/r/864849

Change 876024 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Query stemmed labels in supported languages

https://gerrit.wikimedia.org/r/876024

Change 864849 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Index stemmed labels in supported languages

https://gerrit.wikimedia.org/r/864849

This is now waiting for deployment of the indexing patch and subsequent reindexing of wikibase wikis before the final patch for querying the stemmed labels can be merged.

This is now deployed to all prod wikibase instances

Change 876024 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Query stemmed labels in supported languages

https://gerrit.wikimedia.org/r/876024

This doesn't seem to be working. @dcausse, can you check the config for labels.ko to see if it looks correct?

The original search term we were working with was 가마우지 ("cormorant", a type of bird). If I search in English or Korean, I'm only getting three matches where the "label" or "also known as" field has 가마우지 as a separate word.

I would expect matches from these as well (I double checked offline that the Korean analysis should segment them correctly):

Qlabelko analysis
Q25440민물가마우지민물 + 가마우지
Q727214바다가마우지바다 + 가마우지

(There are lots of other cormorants without Korean labels.)

I also checked 세계 유산 and 세계유산 ("world heritage"—as in "UNESCO world heritage site"—with and without spaces). They get the same number of results (with different ranking) on Korean Wikipedia, but totally different numbers of results searching in Korean on Wikidata (29 for two words, 364 for one word).

Either it's not working, or I'm doing something really wrong—I tested seearching in French vs English to double check and things went as I expected, so I don't think it's me.

Change 897966 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseCirrusSearch@master] fulltext: include stemmed label field in the filter

https://gerrit.wikimedia.org/r/897966

@TJones indeed I think that the query should explicitly add labels.ko to to the filter, it does seem to only add a scoring clause. Pushed a small patch to change how the filter is constructed.

Change 897966 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] fulltext: include stemmed label field in the filter

https://gerrit.wikimedia.org/r/897966

Change 898680 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseCirrusSearch@master] Do not dispatch to query builders that are for non-local entity types

https://gerrit.wikimedia.org/r/898680

Change 898680 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Do not dispatch to query builders that are for non-local entity types

https://gerrit.wikimedia.org/r/898680

@TJones can you take a look agian now? I see a number of results now for a korean search for 가마우지 but I'm not familiar with what it was returning before and can't be certain this is correct.

@EBernhardson, it is working! I also verified some of the original examples from the Wikidata discussion page.