Cyrillic 'Е' and 'Ё' equivalence not found by search
Closed, ResolvedPublic

Description

In Russian language, letter 'Е/е' is often used instead of 'Ё/ё' (noted the two dots above the letter). Even professional sources, books and dictionaries may list, for example, "чёрная дыра" (black hole) as "черная дыра". The problem is, MediaWiki Search Engine treats these two letters as completely separate. It will return zero results when looking for "черный" if the article is called "чёрный" or contains the word.

Current workaround: article editors must always create articles with Е that redirects to Ё. But this workaround only solves the problem of looking for articles. Searching for text inside articles is still broken.

SSneg created this task.Jan 24 2016, 10:11 AM
SSneg updated the task description. (Show Details)
SSneg raised the priority of this task from to Needs Triage.
SSneg added a subscriber: SSneg.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 24 2016, 10:11 AM

Hi @SSneg, thanks for taking the time to report this!

Does this refer to searching on Wikimedia websites? Or does this refer to searching in a local MediaWiki instance? (Different search backends.)
In any case, this does not seem to be about the language bundle for the MediaWiki software.

Aklapper renamed this task from Cyrillic 'Е' and 'Ё' equivalence to Cyrillic 'Е' and 'Ё' equivalence not found by search.Jan 24 2016, 4:32 PM
Aklapper changed the task status from Open to Stalled.
Aklapper set Security to None.
Kf8 added a subscriber: Kf8.Jan 24 2016, 9:33 PM
Arbnos added a subscriber: Arbnos.Jan 25 2016, 12:28 AM
SSneg added a comment.Jan 26 2016, 7:39 PM

@Aklapper This refers to searching on ru.wikipedia.org, not local instance.

Aklapper changed the task status from Stalled to Open.Jan 26 2016, 7:58 PM
Aklapper added a project: CirrusSearch.
Restricted Application added a project: Discovery. · View Herald TranscriptJan 26 2016, 7:58 PM
MaxSem triaged this task as High priority.Jan 26 2016, 11:56 PM
MaxSem added a subscriber: MaxSem.

As a Russian speaker, this is extremely severe.

This bug should have been fixed by T69521.
Unfortunately I think a problem prevented some indices to be updated with the new config.

Analysis config version should be 0.10 but it's still 0.9 for some wikis:

curl -XGET 'elastic1001:9200/mw_cirrus_versions/version/_search?size=1000&q=(analysis_min:9+AND+analysis_maj:0)' | jq .

commonswiki_file
commonswiki_content
commonswiki_general
cawikinews_content
cawikinews_general
enwikinews_content
enwikinews_general
eswikisource_content
eswikisource_general
ukwikisource_content
ukwikisource_general
jvwiki_content
jvwiki_general
ruwiki_content
ruwiki_general
suwiki_content
suwiki_general
thwiki_content
thwiki_general

We will have to do an inplace reindex for these wikis to fix this issue.

Example queries that should give different results, check with @MaxSem for what the results should be:

curl -XGET search.svc.codfw.wmnet:9200/ruwiki_content/page/_search -d '{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"bool":{"minimum_number_should_match":1,"should":[{"query_string":{"query":"\u0435\u043f\u0442\u044b\u0442\u044c","fields":["all.plain^1","all^0.5"],"auto_generate_phrase_queries":true,"phrase_slop":0,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024","max_determinized_states":500}},{"multi_match":{"fields":["all_near_match^2"],"query":"\u0435\u043f\u0442\u044b\u0442\u044c"}}]}},"highlight":{"pre_tags":["<span class=\"searchmatch\">"],"post_tags":["<\/span>"],"fields":{"title":{"type":"experimental","fragmenter":"none","number_of_fragments":1,"matched_fields":["title","title.plain"]},"redirect.title":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["redirect.title","redirect.title.plain"]},"category":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["category","category.plain"]},"heading":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["heading","heading.plain"]},"text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000},"no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000,"skip_if_last_matched":true},"matched_fields":["auxiliary_text","auxiliary_text.plain"]}},"highlight_query":{"query_string":{"query":"\u0435\u043f\u0442\u044b\u0442\u044c","fields":["title.plain^20","redirect.title.plain^15","category.plain^8","heading.plain^5","opening_text.plain^3","text.plain^1","auxiliary_text.plain^0.5","title^10","redirect.title^7.5","category^4","heading^2.5","opening_text^1.5","text^0.5","auxiliary_text^0.25"],"auto_generate_phrase_queries":true,"phrase_slop":1,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024","max_determinized_states":500}}},"suggest":{"text":"\u0435\u043f\u0442\u044b\u0442\u044c","suggest":{"phrase":{"field":"suggest","size":1,"max_errors":2,"confidence":2,"real_word_error_likelihood":0.95,"direct_generator":[{"field":"suggest","suggest_mode":"always","max_term_freq":0.5,"min_doc_freq":0,"prefix_length":2}],"highlight":{"pre_tag":"<em>","post_tag":"<\/em>"},"smoothing":{"stupid_backoff":{"discount":0.4}}}}},"stats":["suggest","full_text"],"size":20,"rescore":[{"window_size":8192,"query":{"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply","rescore_query":{"function_score":{"functions":[{"field_value_factor_with_default":{"field":"incoming_links","modifier":"log2p","missing":0}}]}}}}]}' | jq '.hits.hits|map(._source.title)'

possibly related to T102298

EBernhardson moved this task from Needs triage to Search on the Discovery board.Feb 11 2016, 11:21 PM
TJones claimed this task.Sep 22 2016, 5:46 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptSep 22 2016, 5:46 PM

I'm working on T102298 and I'm in the place to fix this, so I'll do this, too.

TJones moved this task from Needs triage to Up Next on the Discovery-Search board.Sep 22 2016, 6:24 PM
TJones moved this task from Backlog to In progress on the Discovery-Search (Current work) board.

Analysis: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Removing_Stress_Accents_and_Folding_%D0%81_to_%D0%95_for_Russian_Wikis

Summary: unpacking the Elasticsearch Russian analysis chain has some unintended Unicode effects (which are mostly positive).

Patch to follow. Once this is merged, it won't become effective immediately. The changes will go live when indexes are rebuilt as part of the BM25 upgrade. Both full text and completion suggester have been updated.

Change 312547 had a related patch set uploaded (by Tjones):
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

Change 313609 had a related patch set uploaded (by Smalyshev):
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609

Change 312547 merged by jenkins-bot:
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

Change 313609 merged by jenkins-bot:
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609

debt closed this task as Resolved.Oct 21 2016, 7:27 PM