Page MenuHomePhabricator

Add accent squashing to Russian/Cyrillic analyser
Closed, ResolvedPublic

Description

From the mailing list:
From Lars Aronsson lars@aronsson.se via lists.wikimedia.org

This is a suggestion to change search, so it ignores
postfix accents.

Russian dictionaries (including Wiktionary) use accents to
indicate stress on syllables, but these accents are never
seen in plain text.

In Russian Wiktionary, the verb бороться has the
inflected form боритесь (imperative, plural),
which does not have an entry of its own, but
appears in a fact box (table) of inflected forms.
However, since this is a dictionary, the word in
the box is written with an accent: бори́тесь
https://ru.wiktionary.org/wiki/бороться

(I do realize that it would be possible to add
redirect entries for all such inflected forms,
but this has not been done in ru.wiktionary.)

Searching for бори́тесь (which nobody would do)
finds the relevant page,
https://ru.wiktionary.org/w/index.php?search=бори́тесь

but searching for боритесь (the normal thing)
does not find the relevant page,
https://ru.wiktionary.org/w/index.php?search=боритесь

Note that Unicode doesn't contain accented versions
of Cyrillic letters. Instead, the accent is made
by suffixing a separate accent sign.

$ echo "и" | od -c
0000000 320 270  \n

$ echo "и́" | od -c
0000000 320 270 314 201  \n

Nik thinks this might be something we can get out of the unicode normalizer. We should have a look here.

Event Timeline

Manybubbles raised the priority of this task from to Medium.
Manybubbles updated the task description. (Show Details)
Manybubbles moved this task to Search on the Discovery-ARCHIVED board.
Manybubbles subscribed.

We might should do this on the plain analyzer but I think we'll have to do it on the Russian language analyzer for this to work properly with stemming.

Stakeholders: At least ruwiktionary users. Not sure who else.
Benefits: You can find lots more pages!
Estimate: At least a week I think. Maybe two. Firstly we'd have to fix testing so we can build a Russian test wiki. Secondly we'll have to rebuild the Russian analyzer to pick up unicode normalization. And we'll have to make sure that unicode normalization actually normalizes the accents.

Deskana lowered the priority of this task from Medium to Lowest.Nov 24 2015, 6:04 PM
Deskana subscribed.

This was lowered in priority because it only affects Wiktionary, and the cost/benefit tradeoff is probably not worth it.

it only affects Wiktionary

Uh? How do you know? Why would Wikipedia (in any language) not use Russian diacritics?

Uh? How do you know? Why would Wikipedia (in any language) not use Russian diacritics?

My bad, I should've said "probably only affects Wiktionary in any significant fashion". I'm not certain about this, as I don't speak Russian, but I did ask a native Russian speaker.

My bad, I should've said "probably only affects Wiktionary in any significant fashion". I'm not certain about this, as I don't speak Russian, but I did ask a native Russian speaker.

Curious. We have at least one Russian native in WMF staff which seems to disagree, given diacritics are hard. https://lists.wikimedia.org/pipermail/wikidata/2015-April/005896.html

@Nemo_bis I think that mail was about different issue, namely transliteration. But since I'm here anyway and I do speak Russian, I'll add my 2c in hope to clarify the issue.
This specific issue seems to be that Russian words can appear in two forms - unstressed (most of the time) and with stress marks (usually in dictionaries, schoolbooks or, well, encyclopedias when demonstrating correct pronunciation).
Example: the article https://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F starts with word: Википе́дия which is the stressed form of word Википедия. However, when searching, nobody looks for stressed forms (in fact, 99.99% of people would have no idea how to type them), so when indexing that page, the first word should be indexed as Википедия. This should happen everywhere - Wikipedia, Wikidata, Wiktionary, whatever it is - every text index should do that. I see no point (please correct me if I'm wrong here, but I'm pretty sure I am not) in ever indexing stressed forms as nobody will ever search for them.

I do not know if that currently is happening in Wikipedia and whether Wiktionary somehow different, so I am approaching it from purely "how it ideally should be" point of view. But if current analyzers do not do that, they should start doing that. As for priority, I think Russian Wikipedia community would be able to tell us if it's a problem and how big it is. Maybe it isn't, since if it were I would expect them to raise noise much sooner, or maybe they did and I didn't hear it.

We'll be looking at doing this for full text search analysis chain and the completion suggester.

Analysis: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Removing_Stress_Accents_and_Folding_%D0%81_to_%D0%95_for_Russian_Wikis

Summary: unpacking the Elasticsearch Russian analysis chain has some unintended Unicode effects (which are mostly positive). A small number of non-Cyrillic characters have the same accents (mostly etymological info), but that seems like a good trade for generic handling of accents that end up in the wrong place (e.g., on consonants).

Patch to follow. Once this is merged, it won't become effective immediately. The changes will go live when indexes are rebuilt as part of the BM25 upgrade. Both full text and completion suggester have been updated.

Change 312547 had a related patch set uploaded (by Tjones):
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

Change 313609 had a related patch set uploaded (by Smalyshev):
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609

Change 312547 merged by jenkins-bot:
Squash Stress Accents and Fold Ё to Е for Russian Wikis

https://gerrit.wikimedia.org/r/312547

Change 313609 merged by jenkins-bot:
Add tests for Russian folding

https://gerrit.wikimedia.org/r/313609