Add accent squashing to Russian/Cyrillic analyser
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Manybubbles
	Jun 12 2015, 9:39 PM

Description

From the mailing list:
From Lars Aronsson lars@aronsson.se via lists.wikimedia.org

This is a suggestion to change search, so it ignores
postfix accents.

Russian dictionaries (including Wiktionary) use accents to
indicate stress on syllables, but these accents are never
seen in plain text.

In Russian Wiktionary, the verb бороться has the
inflected form боритесь (imperative, plural),
which does not have an entry of its own, but
appears in a fact box (table) of inflected forms.
However, since this is a dictionary, the word in
the box is written with an accent: бори́тесь
https://ru.wiktionary.org/wiki/бороться

(I do realize that it would be possible to add
redirect entries for all such inflected forms,
but this has not been done in ru.wiktionary.)

Searching for бори́тесь (which nobody would do)
finds the relevant page,
https://ru.wiktionary.org/w/index.php?search=бори́тесь

but searching for боритесь (the normal thing)
does not find the relevant page,
https://ru.wiktionary.org/w/index.php?search=боритесь

Note that Unicode doesn't contain accented versions
of Cyrillic letters. Instead, the accent is made
by suffixing a separate accent sign.

$ echo "и" | od -c
0000000 320 270  \n

$ echo "и́" | od -c
0000000 320 270 314 201  \n

Nik thinks this might be something we can get out of the unicode normalizer. We should have a look here.

Details

	Subject	Repo	Branch	Lines +/-
	Add tests for Russian folding	mediawiki/extensions/CirrusSearch	master	+64 -0
	Squash Stress Accents and Fold Ё to Е for Russian Wikis	mediawiki/extensions/CirrusSearch	master	+77 -10

Customize query in gerrit

Related Objects

Mentioned In: T154853: Consider asking communities which languages are analysed the poorest in search
T147505: [tracking] CirrusSearch: what is updated during re-indexing
T146402: Add ICU_folding filter for EN, FR and EL wiki projects
T132637: Lack of diacritic folding in e.g. Ancient Greek
T124592: Cyrillic 'Е' and 'Ё' equivalence not found by search

Event Timeline

• Manybubbles created this task.Jun 12 2015, 9:39 PM

• Manybubbles raised the priority of this task from to Medium.

• Manybubbles updated the task description. (Show Details)

• Manybubbles added projects: Discovery-ARCHIVED, CirrusSearch.

• Manybubbles moved this task to Search on the Discovery-ARCHIVED board.

• Manybubbles subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 12 2015, 9:39 PM

We might should do this on the plain analyzer but I think we'll have to do it on the Russian language analyzer for this to work properly with stemming.

Stakeholders: At least ruwiktionary users. Not sure who else.
Benefits: You can find lots more pages!
Estimate: At least a week I think. Maybe two. Firstly we'd have to fix testing so we can build a Russian test wiki. Secondly we'll have to rebuild the Russian analyzer to pick up unicode normalization. And we'll have to make sure that unicode normalization actually normalizes the accents.

Ricordisamoa subscribed.Jun 12 2015, 9:47 PM

Nemo_bis added a subscriber: I18n.Jun 14 2015, 12:40 AM

Nemo_bis subscribed.

• ksmith added a project: Essential-Work.Oct 27 2015, 3:50 PM

• ksmith set Security to None.

• Deskana lowered the priority of this task from Medium to Lowest.Nov 24 2015, 6:04 PM

• Deskana subscribed.

This was lowered in priority because it only affects Wiktionary, and the cost/benefit tradeoff is probably not worth it.

it only affects Wiktionary

Uh? How do you know? Why would Wikipedia (in any language) not use Russian diacritics?

In T102298#1855241, @Nemo_bis wrote:

Uh? How do you know? Why would Wikipedia (in any language) not use Russian diacritics?

My bad, I should've said "probably only affects Wiktionary in any significant fashion". I'm not certain about this, as I don't speak Russian, but I did ask a native Russian speaker.

My bad, I should've said "probably only affects Wiktionary in any significant fashion". I'm not certain about this, as I don't speak Russian, but I did ask a native Russian speaker.

Curious. We have at least one Russian native in WMF staff which seems to disagree, given diacritics are hard. https://lists.wikimedia.org/pipermail/wikidata/2015-April/005896.html

@Nemo_bis I think that mail was about different issue, namely transliteration. But since I'm here anyway and I do speak Russian, I'll add my 2c in hope to clarify the issue.
This specific issue seems to be that Russian words can appear in two forms - unstressed (most of the time) and with stress marks (usually in dictionaries, schoolbooks or, well, encyclopedias when demonstrating correct pronunciation).
Example: the article https://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F starts with word: Википе́дия which is the stressed form of word Википедия. However, when searching, nobody looks for stressed forms (in fact, 99.99% of people would have no idea how to type them), so when indexing that page, the first word should be indexed as Википедия. This should happen everywhere - Wikipedia, Wikidata, Wiktionary, whatever it is - every text index should do that. I see no point (please correct me if I'm wrong here, but I'm pretty sure I am not) in ever indexing stressed forms as nobody will ever search for them.

I do not know if that currently is happening in Wikipedia and whether Wiktionary somehow different, so I am approaching it from purely "how it ideally should be" point of view. But if current analyzers do not do that, they should start doing that. As for priority, I think Russian Wikipedia community would be able to tell us if it's a problem and how big it is. Maybe it isn't, since if it were I would expect them to raise noise much sooner, or maybe they did and I didn't hear it.

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:38 AM

EBernhardson mentioned this in T124592: Cyrillic 'Е' and 'Ё' equivalence not found by search.Feb 2 2016, 6:03 PM

ObsequiousNewt mentioned this in T132637: Lack of diacritic folding in e.g. Ancient Greek.Apr 13 2016, 10:01 PM

Danny_B added a project: I18n.Jul 3 2016, 11:44 PM

Danny_B removed a subscriber: I18n.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJul 3 2016, 11:44 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.Sep 21 2016, 3:48 PM

TJones claimed this task.Sep 22 2016, 3:20 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

debt mentioned this in T146402: Add ICU_folding filter for EN, FR and EL wiki projects.Sep 22 2016, 6:34 PM

We'll be looking at doing this for full text search analysis chain and the completion suggester.

Analysis: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Removing_Stress_Accents_and_Folding_%D0%81_to_%D0%95_for_Russian_Wikis

Summary: unpacking the Elasticsearch Russian analysis chain has some unintended Unicode effects (which are mostly positive). A small number of non-Cyrillic characters have the same accents (mostly etymological info), but that seems like a good trade for generic handling of accents that end up in the wrong place (e.g., on consonants).

Patch to follow. Once this is merged, it won't become effective immediately. The changes will go live when indexes are rebuilt as part of the BM25 upgrade. Both full text and completion suggester have been updated.

Change 312547 had a related patch set uploaded (by Tjones):
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

gerritbot added a project: Patch-For-Review.Sep 23 2016, 6:55 PM

Smalyshev awarded a token.Sep 23 2016, 11:09 PM

Change 313609 had a related patch set uploaded (by Smalyshev):
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609

Change 312547 merged by jenkins-bot:
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

Change 313609 merged by jenkins-bot:
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609

TJones moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Oct 6 2016, 1:56 PM

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Oct 6 2016, 6:04 PM