Optimize the WikibaseCirrusSearch elasticsearch mapping and filter query for non-english users
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Nov 22 2022, 5:11 PM

Description

The way WikibaseCirrusSearch generates its elasticsearch mapping is sub-optimal for non-english users.
The approach taken by WikibaseCirrusSearch to deal with the multilingual nature of the wikidata content is to create a field per language for the labels field.
Having a subfield per language is costly in term elasticsearch resources but does allow a great level of customization, sadly WikibaseCirrusSearch does not take full benefits of it.

As of today the mapping for the labels.ko field is:

"labels": {
    "properties": {
        "ko": {
            "type": "text",
            "fields": {
                "near_match": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match"
                },
                "near_match_folded": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match_asciifolding"
                },
                "plain": {
                    "type": "text",
                    "analyzer": "ko_plain",
                    "search_analyzer": "ko_plain_search",
                    "similarity": "bm25",
                    "position_increment_gap": 10
                },
                "prefix": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "prefix_asciifolding",
                    "search_analyzer": "near_match_asciifolding"
                }
            },
            "copy_to": [
                "labels_all"
            ]
        }
}

This does index a field name labels.ko using the elasticsearch default text analyzer.

The mapping should look like:

"labels": {
    "properties": {
        "ko": {
            "type": "text",
            "fields": {
                "near_match": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match"
                },
                "near_match_folded": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "near_match_asciifolding"
                },
                "plain": {
                    "type": "text",
                    "analyzer": "ko_plain",
                    "search_analyzer": "ko_plain_search",
                    "similarity": "bm25",
                    "position_increment_gap": 10
                },
                "prefix": {
                    "type": "text",
                    "index_options": "docs",
                    "analyzer": "prefix_asciifolding",
                    "search_analyzer": "near_match_asciifolding"
                }
            },
            "copy_to": [
                "labels_all"
            ],
            "analyzer": "ko_text",
            "search_analyzer": "ko_text_search",
            "similarity": "bm25",
            "position_increment_gap": 10
        }
}

And then when searching in korean the filter should be adapted to also query the labels.ko field and its language fallbacks like what's done for the descriptions.$lang field.

AC:

searching for a word in a label should use the language specific analyzers: e.g. searching for a korean word part of a label labelled as korean should yield search results (TODO: add a specific example here)

Details

Subject	Repo	Branch	Lines +/-
Do not dispatch to query builders that are for non-local entity types	mediawiki/extensions/WikibaseCirrusSearch	master	+7 -0
fulltext: include stemmed label field in the filter	mediawiki/extensions/WikibaseCirrusSearch	master	+59 -0
Query stemmed labels in supported languages	mediawiki/extensions/WikibaseCirrusSearch	master	+108 -57
Index stemmed labels in supported languages	mediawiki/extensions/WikibaseCirrusSearch	master	+36 -16

Customize query in gerrit

Related Objects

Mentioned In: T147505: [tracking] CirrusSearch: what is updated during re-indexing

Event Timeline

dcausse created this task.Nov 22 2022, 5:11 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptNov 22 2022, 5:11 PM

Restricted Application added subscribers: revi, Aklapper. · View Herald Transcript

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Nov 28 2022, 4:35 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

MPhamWMF set the point value for this task to 3.Nov 28 2022, 4:49 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

EBernhardson claimed this task.Dec 2 2022, 6:39 PM

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

As of today the mapping for the labels.ko field is:
...
This does index a field name labels.ko using the elasticsearch default text analyzer.

After looking into it, we actually don't. I think what's happening is the html formatter for mediawiki api responses is stripping out the false values. If i check using the json formatter we can see the index is set to false, accessing the same api from a browser without the json formatting the "index": false is not included.

We can still add it, but worth noting this will be yet another indexed field added to wikibase mappings.

$ curl -s 'https://www.wikidata.org/w/api.php?action=cirrus-mapping-dump&format=json&formatversion=2' | jq .content.properties.labels.properties.ko

{
  "type": "text",
  "index": false,
  "fields": {
    "near_match": {
      "type": "text",
      "index_options": "docs",
      "norms": false,
      "analyzer": "near_match"
    },
    "near_match_folded": {
      "type": "text",
      "index_options": "docs",
      "norms": false,
      "analyzer": "near_match_asciifolding"
    },
    "plain": {
      "type": "text",
      "analyzer": "ko_plain",
      "search_analyzer": "ko_plain_search",
      "similarity": "bm25",
      "position_increment_gap": 10
    },
    "prefix": {
      "type": "text",
      "index_options": "docs",
      "norms": false,
      "analyzer": "prefix_asciifolding",
      "search_analyzer": "near_match_asciifolding"
    }
  },
  "copy_to": [
    "labels_all"
  ]
}

In T323628#8440025, @EBernhardson wrote:

As of today the mapping for the labels.ko field is:
...
This does index a field name labels.ko using the elasticsearch default text analyzer.

After looking into it, we actually don't. I think what's happening is the html formatter for mediawiki api responses is stripping out the false values. If i check using the json formatter we can see the index is set to false, accessing the same api from a browser without the json formatting the "index": false is not included.

Thanks for checking! I'm tempted to decline, the purpose of ticket was to fix a missed opportunity and I think that adding yet another 500+ indexed fields is not something we could do without evaluating the impact and there might be other things (less costly) we could do to improve recall in languages other than English (e.g. use the icu tokenizer for the labels_all.plain field?).

Pinging @TJones for advises on this matter.

In T323628#8441844, @dcausse wrote:

Thanks for checking! I'm tempted to decline, the purpose of ticket was to fix a missed opportunity and I think that adding yet another 500+ indexed fields is not something we could do without evaluating the impact and there might be other things (less costly) we could do to improve recall in languages other than English (e.g. use the icu tokenizer for the labels_all.plain field?).

Pinging @TJones for advises on this matter.

This wouldn't actually be 500 fields, at least for descriptions this is only indexed for languages in our stemmed languages list. It would add approximately 40 new indexed fields.

I really don't want to decline this ticket, because the current situation is very bad for spaceless languages and worse than it needs to be for any language with language-specific analysis.

I'd argue that it's more important for labels to have analysis than descriptions.

The original Korean example talked about cormorants (which are birds). In English, if I search for cormorant, among the results are many birds—Great Cormorant, Double-crested Cormorant, Little Cormorant, Japanese Cormorant, Pelagic Cormorant—and these all have the same description: "species of bird". What's the use of being able to search species of birds (plural) in the description but not search cormorant in the label? The labels are so much more useful content, so language analysis there is more useful, too.

Can we add the analysis to labels and remove it from descriptions to keep down the number of indexed fields?

Alternatively, we could push this out to next quarter or later and spend more time assessing whether we could afford to index both label and description with language-specific analysis.

(e.g. use the icu tokenizer for the labels_all.plain field?).

The ICU tokenizer would help with some languages for sure, but it specifically doesn't do anything useful for Korean—it splits Hangul on spaces, punctuation, etc, but doesn't break it into words at all.

So... some options (in order of my preference, which may not be everyone else's):

Investigate whether we can afford to have language-specific analysis for descriptions and labels
Swap language-specific analysis from descriptions to labels
Enable ICU tokenization for labels (or descriptions if we go with the previous option)
Maintain the status quo

This wouldn't actually be 500 fields, at least for descriptions this is only indexed for fields in our stemmed languages list. It would add approximately 40 new indexed fields.

Oh, I vote harder for my #1 option, then!

If it's 40 fields then #1 for me too :)

Change 864849 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Index and query stemmed labels in supported languages

https://gerrit.wikimedia.org/r/864849

gerritbot added a project: Patch-For-Review.Dec 6 2022, 4:45 PM

EBernhardson moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Dec 7 2022, 6:24 PM

Change 876024 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikibaseCirrusSearch@master] Query stemmed labels in supported languages

https://gerrit.wikimedia.org/r/876024

Change 864849 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Index stemmed labels in supported languages

https://gerrit.wikimedia.org/r/864849

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.18; 2023-01-09).Jan 6 2023, 8:00 AM

Gehel moved this task from Needs review to In Progress on the Discovery-Search (Current work) board.Jan 9 2023, 4:05 PM

Gehel moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.

This is now waiting for deployment of the indexing patch and subsequent reindexing of wikibase wikis before the final patch for querying the stemmed labels can be merged.

EBernhardson mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Jan 9 2023, 6:06 PM

This is now deployed to all prod wikibase instances

dcausse moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Mar 6 2023, 3:49 PM

Change 876024 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Query stemmed labels in supported languages

https://gerrit.wikimedia.org/r/876024

ReleaseTaggerBot edited projects, added MW-1.40-notes (1.40.0-wmf.26; 2023-03-06); removed MW-1.40-notes (1.40.0-wmf.18; 2023-01-09).Mar 6 2023, 7:00 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 6 2023, 7:10 PM

This doesn't seem to be working. @dcausse, can you check the config for labels.ko to see if it looks correct?

The original search term we were working with was 가마우지 ("cormorant", a type of bird). If I search in English or Korean, I'm only getting three matches where the "label" or "also known as" field has 가마우지 as a separate word.

I would expect matches from these as well (I double checked offline that the Korean analysis should segment them correctly):

Q	label	ko analysis
Q25440	민물가마우지	민물 + 가마우지
Q727214	바다가마우지	바다 + 가마우지

(There are lots of other cormorants without Korean labels.)

I also checked 세계 유산 and 세계유산 ("world heritage"—as in "UNESCO world heritage site"—with and without spaces). They get the same number of results (with different ranking) on Korean Wikipedia, but totally different numbers of results searching in Korean on Wikidata (29 for two words, 364 for one word).

Either it's not working, or I'm doing something really wrong—I tested seearching in French vs English to double check and things went as I expected, so I don't think it's me.

Change 897966 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseCirrusSearch@master] fulltext: include stemmed label field in the filter

https://gerrit.wikimedia.org/r/897966

gerritbot added a project: Patch-For-Review.Mar 13 2023, 7:17 PM

@TJones indeed I think that the query should explicitly add labels.ko to to the filter, it does seem to only add a scoring clause. Pushed a small patch to change how the filter is constructed.

Change 897966 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] fulltext: include stemmed label field in the filter

https://gerrit.wikimedia.org/r/897966

ReleaseTaggerBot edited projects, added MW-1.40-notes (1.40.0-wmf.27; 2023-03-13); removed MW-1.40-notes (1.40.0-wmf.26; 2023-03-06).Mar 13 2023, 8:00 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 13 2023, 8:10 PM

Change 898680 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseCirrusSearch@master] Do not dispatch to query builders that are for non-local entity types

https://gerrit.wikimedia.org/r/898680

gerritbot added a project: Patch-For-Review.Mar 14 2023, 8:38 AM

Change 898680 merged by jenkins-bot:

[mediawiki/extensions/WikibaseCirrusSearch@master] Do not dispatch to query builders that are for non-local entity types

https://gerrit.wikimedia.org/r/898680

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.1; 2023-03-20).Mar 14 2023, 4:01 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 14 2023, 4:10 PM

@TJones can you take a look agian now? I see a number of results now for a korean search for 가마우지 but I'm not familiar with what it was returning before and can't be certain this is correct.

@EBernhardson, it is working! I also verified some of the original examples from the Wikidata discussion page.

EBernhardson moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Apr 10 2023, 7:32 PM

Gehel closed this task as Resolved.Apr 20 2023, 7:21 PM

Optimize the WikibaseCirrusSearch elasticsearch mapping and filter query for non-english usersClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Optimize the WikibaseCirrusSearch elasticsearch mapping and filter query for non-english users
Closed, ResolvedPublic3 Estimated Story Points
Actions