Page MenuHomePhabricator

Investigate impact of folding diacritics in Slovak
Closed, ResolvedPublic

Description

I discussed searching without diacritics (which can be missing depending on your keyboard) with @Jetam2, so we should investigate the impact of stripping the diacritics on search, and then take the results to the community to discuss whether it should be implemented.

Event Timeline

@Jetam2, also check out the Universal Language Selector. It's not a virtual keyboard, rather it remaps your physical keyboard, so if you can touch type, it should give you access to all the characters you need.

(As a side note, I figures out what happened with the Wikidata search. It has a series of fallback languages. For Slovak it will also search the Czech and English labels and descriptions (with much lower weight), so when neither Slovak or Czech match Sibenik, English does.)

If we can find each other at the Hackathon again, I can demo the ULS (so you can try typing in Slovak on my computer) and I can show you how to find the Wikidata language fallbacks (short version: add &cirrusDumpQuery to the search URL).

I'm still working on gathering data and doing the analysis. I've had some computer problems and I have an end-of-month project I have to focus on this week.

However, there's some good news: @Amire80 recently released a bunch of updates to the Universal Language Selector for African languages. I'm not very familiar with the ULS, so I didn't realize it could do multi-character substitutions. For example, with the "Bambara tilde" keyboard you can type as a~/, as o~^, and as c~v. ä can be typed as a~: in other keyboards (like Sängö).

It seems like it would be very straightforward to create an appropriate "tilde keyboard" for Slovak, if that would help—especially if the analysis of searching without diacritics doesn't look good.

Thank you.

I created a note in our Teahouse to ask for comments.

As for latin chars with diacritics, maybe even a simple iconv/recode-based ASCII translit term normalization (both on indexing and lookup side) would be sufficient for Slovak and other latin-based alphabets:

> echo áâăąäćçčďđëéěęíîĺľłńňöóőôŕřśşšťţüúůűýżźž | iconv -f UTF8 -t ASCII//TRANSLIT
aaaaacccddeeeeiilllnnoooorrsssttuuuuyzzz

> echo áâăąäćçčďđëéěęíîĺľłńňöóőôŕřśşšťţüúůűýżźž | recode -f UTF8..flat
aaaaacccddeeeeiilllnnoo"oorrsssttuuu"uyzzz

Non-latin scripts (Greek, Cyrillic, ...) and other fancy Unicode blocks should stay unmapped (or transliterated properly).

[Ugh. Accidentally saved. Back in a minute with a proper comment.]

@Jetam2, thanks for starting the Teahouse discussion!

@Teslaton—thanks for that! However, it's not the actual conversion from characters with diacritics to plain ASCII versions that's the problem. It's the question of whether doing the folding is a good idea. Two years ago I worked on T155822, which lead to a similar discussion about Swedish å, ä, and ö. The relevant conversation on Phab starts about here: T155822#3098703

The generalization I drew from that was that it makes sense to keep characters in the alphabet of a language distinct when searching in that language, and to fold other characters (which is why English speakers often jump to the idea of folding everything—no diacritics really matter in English).

So, my goal for this task is to see how big an impact this will have by seeing how often words would "collide" because of diacritic stripping. For example, searching for mäso would also find all the results for maso, and vice versa. The question is whether this is a small problem or a big problem.

Also, I will look at queries to see if it's possible to tell how often people are searching for words without diacritics. Maybe it's a common problem, or maybe it's a rare problem, or maybe we can't tell.

Then we can discuss what to do about it. The obvious potential solutions are stripping the diacritics or enabling a Universal Language Selector keyboard that allows people to type diacritics without having them on their keyboard.

I should be able to work on all this next week.

As far as I know, most common (and "expected", from the user point of view) approch when implementing FTI solutions in Slovak-centered projects, is to normalize all latin variants of each char into it's plain ASCII equivalent. The resulting side-effect of (search) equivalence of words like mäso/maso is a minor one in Slovak (and users can solve this by iterating through search results, where term matches are presented in their original, non-normalized form).

A "perfect" solution would be to index folded variants, but to boost rank of exact (unfolded) matches a bit when sorting search results, but I don't know if this is possible in FTI implementation on skwiki MW backend.

Is this issue unique to Slovak? Isn't it very similar in French, German, Italian, Spanish, and Czech? I imagine that whatever works for these languages, should work for Slovak, too.

It's possible to make a ULS/jquery.ime keyboard for Slovak, but I suspect that it's not quite the solution for whatever is the issue here. As far as I can guess, it's not very difficult to find a keyboard that can type Slovak with all the necessary diacritics on desktop computer and mobile devices, but please correct me if I'm wrong. We usually build ULS/jquery.ime keyboards for languages in which finding such a keyboard is difficult.

@Teslaton:

A "perfect" solution would be to index folded variants, but to boost rank of exact (unfolded) matches a bit when sorting search results, but I don't know if this is possible in FTI implementation on skwiki MW backend.

It's generally not possible to support language-specific or project-specific indexes, but what you suggest is already what we do! You can see the difference on English Wikipedia in the differences in ranking when searching for zoe and zoë.

@Amire80: I doubt any issue is truly unique anywhere, but what @Jetam2 told me at the Hackathon was that some Slovak users don't have access to all the Slovak diacritics at least some of the time.

If you don't touch type, having a keyboard mapping on a desktop/laptop doesn't solve the problem.

Isn't it very similar in French, German, Italian, Spanish, and Czech?

I thought so, but maybe not. Swedish speakers seemed to very much want to keep the diacritical characters distinct (T155822).

The generalization I drew from [Swedish] was that it makes sense to keep characters in the alphabet of a language distinct when searching in that language, and to fold other characters

But @Teslaton says above that that's not the expectation in Slovak:

in Slovak-centered projects, is to normalize all latin variants of each char into it's plain ASCII equivalent

There's a similar request for Serbian: T138858.

It's possible to make a ULS/jquery.ime keyboard for Slovak, but I suspect that it's not quite the solution for whatever is the issue here.

I can't help but worry about unwanted word collisions, so my plan was to review a sample of wikipedia and wiktionary articles and see how bad the collisions seem. If they are unacceptable, then enabling a ULS keyboard would be a fallback plan.

This took a little longer than I expected because the changes were much bigger than I expected, and it's been a while since I looked at this kind of relative change (rather than implementing a whole new stemmer, say), so the tools needed a little updating.

My write up so far is on MediaWiki. A summary, links to specific parts, and a request for help is below.

  • It does indeed look like Slovak searchers search for words with and without diacritics (examples from the names of recent presidential candidates show lots of variation).
  • Enabling folding for Slovak letters (Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž) interacts with the stemmer, and a lot of related forms of words are no longer stemmed correctly, because stemming occurs after folding. The suffixes -ách, -aný, -ého, -ému, -í, -ú, -ých, -ým, and -ými in particular do not fare well.
  • Obvious next steps are to move folding after stemming (easier) or to try to make the stemmer able to process words without diacritics (harder).

However, I'm going to be out of the office for the next two weeks (I'll be back the first week of July).

So, it would be great if Slovak speakers (including @Jetam2 and @Teslaton—thanks!) could review the examples in the write up and identify any additional problems and verify that other changes are desired. You can leave comments here, or on the Discussion page for the write up.

There are about 100 groups of words showing before-and-after changes to the groups of words that will be indexed together.

They are divided into 3 main sections, to help organize the data and put similar cases together:

  • Groups that lost words—these are mostly because of the suffixes -ách, -aný, -ého, -ému, -í, -ú, -ých, -ým, and -ými not being seen by the stemmer.
  • Groups that gained words—these are mostly the expected changes, like Amalia and Amália being grouped together, though there are some potential odd cases.
  • Groups that gained and lost words—these are a combination of the two above; I moved them to the end because they are more complicated to read through.

Each groups is divided into 3 sub-groups:

  • A random sample—this is the most representative sample.
  • "High-Impact" groups—groups that lost or gained a lot of words (10 or more); these are more likely to be problem cases, since lots of ambiguity is being introduced.
  • Groups with high-frequency words—groups with words that occur 1000 times or more in the sample. These are also likely to be problem cases, because grouping really common words with rarer words tends to swamp the rarer words.

My hope is that the groups that gained words mostly look good, and we can fix most of the lost words by moving folding to after stemming. We'll see in the first week of July.

Thank you for looking into this. The way I see it, the key sentence is this one "So, clearly Slovak searchers are expecting diacriticless searches to get results, contrary to the expectations of the Swedish searchers. " Let's wait for more info in July.

Just to add more anecdotal searches. In sk orthography, there is often the question of whether to use y or i/ý or í. I searched for "anýz" and found a lot of Andy but not "aníz" that I was looking for. This i or y should also be addressed. Thank you.

Sorry for the delay getting back to this. In addition to my planned two weeks away from the office I had another unexpected week away. I've been catching up on everything this week, and I'm back to working on this now.

Thank you for looking into this. The way I see it, the key sentence is this one "So, clearly Slovak searchers are expecting diacriticless searches to get results, contrary to the expectations of the Swedish searchers. " Let's wait for more info in July.

I agree that we want to implement the diacriticless search as long as the results are good.

I think the first attempt, documented above, does more harm than good by blocking the proper stemming of -ách, -aný, -ého, -ému, -í, -ú, -ých, -ým, and -ými suffixes.

Can you review the groups in the sections labeled "Speaker Review" and verify the following?

  • The "Folding Groups that Lost Members" changes are generally bad.
  • The "Folding Groups that Gained Members" changes are generally good—though I'm most concerned about the stal, pol, co, ked, and su groups.
  • The "Folding Groups that Lost and Gained (Mixed) Members" are okay, other than problems with the previously mentioned suffix list. I'm particularly concerned about the "High-Frequency" groups.

I'll work on generating similar data with stemming moved before folding, but having a speaker assessment of the currently available groups would be helpful in focusing the next round on analysis.

Just to add more anecdotal searches. In sk orthography, there is often the question of whether to use y or i/ý or í. I searched for "anýz" and found a lot of Andy but not "aníz" that I was looking for. This i or y should also be addressed. Thank you.

I'd like to settle the diacriticless searching issue before addressing additional complications. I'd also like some more info, such as whether this happens everywhere or just in specific contexts. Do people have a problem with these letters in suffixes like -ých or -ý? Is it only in short words? Does it happen with the first letter of the word, too? Is the confusion more likely to go in one direction (like, people frequently use i when they should use y, but only rarely do it in the other directions)? More examples in general would be useful.

Also, note that the andy results you got are based on a suggestion, because anýz gets 0 results. You can see the //anýz//-only results, which have only the andy suggestion.

I've completed my analysis for stemming before folding, and it definitely looks much better. The new, mostly desirable merges are roughly the same, without preventing the stemmer from doing its job. The stemmer still needs an update, though. (T227924: Improve Slovak Stemmer)

It still needs speaker review before we can deploy it and re-index. @Jetam2, @Teslaton, can you help with that? Let me know if the task needs more explanation. Comments here or on the discussion page would be great.

Having some trouble with the speaker review being unclear, so I'm working on better generic documentation I can use for this task and in the future.

Okay, I've made the first pass at writing speaker review documentation that can be transcluded into the notes for a particular language.

The part of the Slovak notes that needs speaker review starts here.

The full speaker review notes are here.

Moving this to "Waiting" for a while, to see if we get any feedback from Slovak speakers. If not, we will consider options for next steps, which include pushing forward with this with non-native speaker review, maintaining the status quo, and rolling back the Slovak stemmer.

I got some great feedback from @Jetam2, and everything is looking good. I just need a few clarifications. If everything is still good, then we'll be ready to deploy the new analysis chain and then re-index.

I've updated my notes on Mediawiki with the final round of review. We're now ready to create a patch to implement the folding: T235561: Implement folding of Slovak diacritics for Slovak-language wikis

After that we'll need to re-index Slovak-language wikis.

Changes have been merged and re-indexing is in-progress, so I'm closing this ticket.