Page MenuHomePhabricator

Investigate impact of folding diacritics in Slovak
Open, Needs TriagePublic

Description

I discussed searching without diacritics (which can be missing depending on your keyboard) with @Jetam2, so we should investigate the impact of stripping the diacritics on search, and then take the results to the community to discuss whether it should be implemented.

Event Timeline

TJones created this task.May 19 2019, 9:47 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2019, 9:47 AM

@Jetam2, also check out the Universal Language Selector. It's not a virtual keyboard, rather it remaps your physical keyboard, so if you can touch type, it should give you access to all the characters you need.

(As a side note, I figures out what happened with the Wikidata search. It has a series of fallback languages. For Slovak it will also search the Czech and English labels and descriptions (with much lower weight), so when neither Slovak or Czech match Sibenik, English does.)

If we can find each other at the Hackathon again, I can demo the ULS (so you can try typing in Slovak on my computer) and I can show you how to find the Wikidata language fallbacks (short version: add &cirrusDumpQuery to the search URL).

TJones claimed this task.
TJones added a subscriber: Amire80.Thu, May 30, 2:01 PM

I'm still working on gathering data and doing the analysis. I've had some computer problems and I have an end-of-month project I have to focus on this week.

However, there's some good news: @Amire80 recently released a bunch of updates to the Universal Language Selector for African languages. I'm not very familiar with the ULS, so I didn't realize it could do multi-character substitutions. For example, with the "Bambara tilde" keyboard you can type as a~/, as o~^, and as c~v. ä can be typed as a~: in other keyboards (like Sängö).

It seems like it would be very straightforward to create an appropriate "tilde keyboard" for Slovak, if that would help—especially if the analysis of searching without diacritics doesn't look good.

Thank you.

I created a note in our Teahouse to ask for comments.

As for latin chars with diacritics, maybe even a simple iconv/recode-based ASCII translit term normalization (both on indexing and lookup side) would be sufficient for Slovak and other latin-based alphabets:

> echo áâăąäćçčďđëéěęíîĺľłńňöóőôŕřśşšťţüúůűýżźž | iconv -f UTF8 -t ASCII//TRANSLIT
aaaaacccddeeeeiilllnnoooorrsssttuuuuyzzz

> echo áâăąäćçčďđëéěęíîĺľłńňöóőôŕřśşšťţüúůűýżźž | recode -f UTF8..flat
aaaaacccddeeeeiilllnnoo"oorrsssttuuu"uyzzz

Non-latin scripts (Greek, Cyrillic, ...) and other fancy Unicode blocks should stay unmapped (or transliterated properly).

TJones added a comment.EditedFri, May 31, 8:45 PM

[Ugh. Accidentally saved. Back in a minute with a proper comment.]

@Jetam2, thanks for starting the Teahouse discussion!

@Teslaton—thanks for that! However, it's not the actual conversion from characters with diacritics to plain ASCII versions that's the problem. It's the question of whether doing the folding is a good idea. Two years ago I worked on T155822, which lead to a similar discussion about Swedish å, ä, and ö. The relevant conversation on Phab starts about here: T155822#3098703

The generalization I drew from that was that it makes sense to keep characters in the alphabet of a language distinct when searching in that language, and to fold other characters (which is why English speakers often jump to the idea of folding everything—no diacritics really matter in English).

So, my goal for this task is to see how big an impact this will have by seeing how often words would "collide" because of diacritic stripping. For example, searching for mäso would also find all the results for maso, and vice versa. The question is whether this is a small problem or a big problem.

Also, I will look at queries to see if it's possible to tell how often people are searching for words without diacritics. Maybe it's a common problem, or maybe it's a rare problem, or maybe we can't tell.

Then we can discuss what to do about it. The obvious potential solutions are stripping the diacritics or enabling a Universal Language Selector keyboard that allows people to type diacritics without having them on their keyboard.

I should be able to work on all this next week.

Teslaton added a comment.EditedFri, May 31, 9:09 PM

As far as I know, most common (and "expected", from the user point of view) approch when implementing FTI solutions in Slovak-centered projects, is to normalize all latin variants of each char into it's plain ASCII equivalent. The resulting side-effect of (search) equivalence of words like mäso/maso is a minor one in Slovak (and users can solve this by iterating through search results, where term matches are presented in their original, non-normalized form).

A "perfect" solution would be to index folded variants, but to boost rank of exact (unfolded) matches a bit when sorting search results, but I don't know if this is possible in FTI implementation on skwiki MW backend.

Is this issue unique to Slovak? Isn't it very similar in French, German, Italian, Spanish, and Czech? I imagine that whatever works for these languages, should work for Slovak, too.

It's possible to make a ULS/jquery.ime keyboard for Slovak, but I suspect that it's not quite the solution for whatever is the issue here. As far as I can guess, it's not very difficult to find a keyboard that can type Slovak with all the necessary diacritics on desktop computer and mobile devices, but please correct me if I'm wrong. We usually build ULS/jquery.ime keyboards for languages in which finding such a keyboard is difficult.

TJones added a comment.Mon, Jun 3, 4:13 PM

@Teslaton:

A "perfect" solution would be to index folded variants, but to boost rank of exact (unfolded) matches a bit when sorting search results, but I don't know if this is possible in FTI implementation on skwiki MW backend.

It's generally not possible to support language-specific or project-specific indexes, but what you suggest is already what we do! You can see the difference on English Wikipedia in the differences in ranking when searching for zoe and zoë.

@Amire80: I doubt any issue is truly unique anywhere, but what @Jetam2 told me at the Hackathon was that some Slovak users don't have access to all the Slovak diacritics at least some of the time.

If you don't touch type, having a keyboard mapping on a desktop/laptop doesn't solve the problem.

Isn't it very similar in French, German, Italian, Spanish, and Czech?

I thought so, but maybe not. Swedish speakers seemed to very much want to keep the diacritical characters distinct (T155822).

The generalization I drew from [Swedish] was that it makes sense to keep characters in the alphabet of a language distinct when searching in that language, and to fold other characters

But @Teslaton says above that that's not the expectation in Slovak:

in Slovak-centered projects, is to normalize all latin variants of each char into it's plain ASCII equivalent

There's a similar request for Serbian: T138858.

It's possible to make a ULS/jquery.ime keyboard for Slovak, but I suspect that it's not quite the solution for whatever is the issue here.

I can't help but worry about unwanted word collisions, so my plan was to review a sample of wikipedia and wiktionary articles and see how bad the collisions seem. If they are unacceptable, then enabling a ULS keyboard would be a fallback plan.

This took a little longer than I expected because the changes were much bigger than I expected, and it's been a while since I looked at this kind of relative change (rather than implementing a whole new stemmer, say), so the tools needed a little updating.

My write up so far is on MediaWiki. A summary, links to specific parts, and a request for help is below.

  • It does indeed look like Slovak searchers search for words with and without diacritics (examples from the names of recent presidential candidates show lots of variation).
  • Enabling folding for Slovak letters (Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž) interacts with the stemmer, and a lot of related forms of words are no longer stemmed correctly, because stemming occurs after folding. The suffixes -ách, -aný, -ého, -ému, -í, -ú, -ých, -ým, and -ými in particular do not fare well.
  • Obvious next steps are to move folding after stemming (easier) or to try to make the stemmer able to process words without diacritics (harder).

However, I'm going to be out of the office for the next two weeks (I'll be back the first week of July).

So, it would be great if Slovak speakers (including @Jetam2 and @Teslaton—thanks!) could review the examples in the write up and identify any additional problems and verify that other changes are desired. You can leave comments here, or on the Discussion page for the write up.

There are about 100 groups of words showing before-and-after changes to the groups of words that will be indexed together.

They are divided into 3 main sections, to help organize the data and put similar cases together:

  • Groups that lost words—these are mostly because of the suffixes -ách, -aný, -ého, -ému, -í, -ú, -ých, -ým, and -ými not being seen by the stemmer.
  • Groups that gained words—these are mostly the expected changes, like Amalia and Amália being grouped together, though there are some potential odd cases.
  • Groups that gained and lost words—these are a combination of the two above; I moved them to the end because they are more complicated to read through.

Each groups is divided into 3 sub-groups:

  • A random sample—this is the most representative sample.
  • "High-Impact" groups—groups that lost or gained a lot of words (10 or more); these are more likely to be problem cases, since lots of ambiguity is being introduced.
  • Groups with high-frequency words—groups with words that occur 1000 times or more in the sample. These are also likely to be problem cases, because grouping really common words with rarer words tends to swamp the rarer words.

My hope is that the groups that gained words mostly look good, and we can fix most of the lost words by moving folding to after stemming. We'll see in the first week of July.