The way WikibaseCirrusSearch generates its elasticsearch mapping is sub-optimal for non-english users.
The approach taken by WikibaseCirrusSearch to deal with the multilingual nature of the wikidata content is to create a field per language for the labels field.
Having a subfield per language is costly in term elasticsearch resources but does allow a great level of customization, sadly WikibaseCirrusSearch does not take full benefits of it.
As of today the mapping for the labels.ko field is:
"labels": { "properties": { "ko": { "type": "text", "fields": { "near_match": { "type": "text", "index_options": "docs", "analyzer": "near_match" }, "near_match_folded": { "type": "text", "index_options": "docs", "analyzer": "near_match_asciifolding" }, "plain": { "type": "text", "analyzer": "ko_plain", "search_analyzer": "ko_plain_search", "similarity": "bm25", "position_increment_gap": 10 }, "prefix": { "type": "text", "index_options": "docs", "analyzer": "prefix_asciifolding", "search_analyzer": "near_match_asciifolding" } }, "copy_to": [ "labels_all" ] } }
This does index a field name labels.ko using the elasticsearch default text analyzer.
The mapping should look like:
"labels": { "properties": { "ko": { "type": "text", "fields": { "near_match": { "type": "text", "index_options": "docs", "analyzer": "near_match" }, "near_match_folded": { "type": "text", "index_options": "docs", "analyzer": "near_match_asciifolding" }, "plain": { "type": "text", "analyzer": "ko_plain", "search_analyzer": "ko_plain_search", "similarity": "bm25", "position_increment_gap": 10 }, "prefix": { "type": "text", "index_options": "docs", "analyzer": "prefix_asciifolding", "search_analyzer": "near_match_asciifolding" } }, "copy_to": [ "labels_all" ], "analyzer": "ko_text", "search_analyzer": "ko_text_search", "similarity": "bm25", "position_increment_gap": 10 } }
And then when searching in korean the filter should be adapted to also query the labels.ko field and its language fallbacks like what's done for the descriptions.$lang field.
AC:
- searching for a word in a label should use the language specific analyzers: e.g. searching for a korean word part of a label labelled as korean should yield search results (TODO: add a specific example here)