Page MenuHomePhabricator

Cyrillic 'Е' and 'Ё' equivalence not found by search
Closed, ResolvedPublic

Description

In Russian language, letter 'Е/е' is often used instead of 'Ё/ё' (noted the two dots above the letter). Even professional sources, books and dictionaries may list, for example, "чёрная дыра" (black hole) as "черная дыра". The problem is, MediaWiki Search Engine treats these two letters as completely separate. It will return zero results when looking for "черный" if the article is called "чёрный" or contains the word.

Current workaround: article editors must always create articles with Е that redirects to Ё. But this workaround only solves the problem of looking for articles. Searching for text inside articles is still broken.

Event Timeline

SSneg raised the priority of this task from to Needs Triage.
SSneg updated the task description. (Show Details)
SSneg subscribed.

Hi @SSneg, thanks for taking the time to report this!

Does this refer to searching on Wikimedia websites? Or does this refer to searching in a local MediaWiki instance? (Different search backends.)
In any case, this does not seem to be about the language bundle for the MediaWiki software.

Aklapper renamed this task from Cyrillic 'Е' and 'Ё' equivalence to Cyrillic 'Е' and 'Ё' equivalence not found by search.Jan 24 2016, 4:32 PM
Aklapper changed the task status from Open to Stalled.
Aklapper set Security to None.

@Aklapper This refers to searching on ru.wikipedia.org, not local instance.

Aklapper changed the task status from Stalled to Open.Jan 26 2016, 7:58 PM
Aklapper added a project: CirrusSearch.
MaxSem triaged this task as High priority.Jan 26 2016, 11:56 PM
MaxSem subscribed.

As a Russian speaker, this is extremely severe.

This bug should have been fixed by T69521.
Unfortunately I think a problem prevented some indices to be updated with the new config.

Analysis config version should be 0.10 but it's still 0.9 for some wikis:

curl -XGET 'elastic1001:9200/mw_cirrus_versions/version/_search?size=1000&q=(analysis_min:9+AND+analysis_maj:0)' | jq .

commonswiki_file
commonswiki_content
commonswiki_general
cawikinews_content
cawikinews_general
enwikinews_content
enwikinews_general
eswikisource_content
eswikisource_general
ukwikisource_content
ukwikisource_general
jvwiki_content
jvwiki_general
ruwiki_content
ruwiki_general
suwiki_content
suwiki_general
thwiki_content
thwiki_general

We will have to do an inplace reindex for these wikis to fix this issue.

Example queries that should give different results, check with @MaxSem for what the results should be:

curl -XGET search.svc.codfw.wmnet:9200/ruwiki_content/page/_search -d '{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"bool":{"minimum_number_should_match":1,"should":[{"query_string":{"query":"\u0435\u043f\u0442\u044b\u0442\u044c","fields":["all.plain^1","all^0.5"],"auto_generate_phrase_queries":true,"phrase_slop":0,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024","max_determinized_states":500}},{"multi_match":{"fields":["all_near_match^2"],"query":"\u0435\u043f\u0442\u044b\u0442\u044c"}}]}},"highlight":{"pre_tags":["<span class=\"searchmatch\">"],"post_tags":["<\/span>"],"fields":{"title":{"type":"experimental","fragmenter":"none","number_of_fragments":1,"matched_fields":["title","title.plain"]},"redirect.title":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["redirect.title","redirect.title.plain"]},"category":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["category","category.plain"]},"heading":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["heading","heading.plain"]},"text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000},"no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000,"skip_if_last_matched":true},"matched_fields":["auxiliary_text","auxiliary_text.plain"]}},"highlight_query":{"query_string":{"query":"\u0435\u043f\u0442\u044b\u0442\u044c","fields":["title.plain^20","redirect.title.plain^15","category.plain^8","heading.plain^5","opening_text.plain^3","text.plain^1","auxiliary_text.plain^0.5","title^10","redirect.title^7.5","category^4","heading^2.5","opening_text^1.5","text^0.5","auxiliary_text^0.25"],"auto_generate_phrase_queries":true,"phrase_slop":1,"default_operator":"AND","allow_leading_wildcard":false,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024","max_determinized_states":500}}},"suggest":{"text":"\u0435\u043f\u0442\u044b\u0442\u044c","suggest":{"phrase":{"field":"suggest","size":1,"max_errors":2,"confidence":2,"real_word_error_likelihood":0.95,"direct_generator":[{"field":"suggest","suggest_mode":"always","max_term_freq":0.5,"min_doc_freq":0,"prefix_length":2}],"highlight":{"pre_tag":"<em>","post_tag":"<\/em>"},"smoothing":{"stupid_backoff":{"discount":0.4}}}}},"stats":["suggest","full_text"],"size":20,"rescore":[{"window_size":8192,"query":{"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply","rescore_query":{"function_score":{"functions":[{"field_value_factor_with_default":{"field":"incoming_links","modifier":"log2p","missing":0}}]}}}}]}' | jq '.hits.hits|map(._source.title)'

I'm working on T102298 and I'm in the place to fix this, so I'll do this, too.

Analysis: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Removing_Stress_Accents_and_Folding_%D0%81_to_%D0%95_for_Russian_Wikis

Summary: unpacking the Elasticsearch Russian analysis chain has some unintended Unicode effects (which are mostly positive).

Patch to follow. Once this is merged, it won't become effective immediately. The changes will go live when indexes are rebuilt as part of the BM25 upgrade. Both full text and completion suggester have been updated.

Change 312547 had a related patch set uploaded (by Tjones):
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

Change 313609 had a related patch set uploaded (by Smalyshev):
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609

Change 312547 merged by jenkins-bot:
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

Change 313609 merged by jenkins-bot:
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609