Page MenuHomePhabricator

Search term entered without diacritics on Czech Wikipedia does not list expected match
Open, HighPublic

Description

Steps to reproduce:

  1. Go to https://cs.wikipedia.org/ (logged in; Firefox 93)
  2. In the search bar, enter burcak (as the term Burčák exists but I might not have a keyboard that allows to easily add diacritics)
  3. Look at the proposed autocomplete results
  4. Click the Hledat (Search) button

Actual outcomes after step 3 and step 4:

Screenshot from 2021-10-14 19-36-53.png (975×971 px, 257 KB)

Screenshot from 2021-10-14 19-37-09.png (975×1 px, 159 KB)

Expected outcome:
Seeing existing https://cs.wikipedia.org/wiki/Burčák listed that I could select

Event Timeline

MPhamWMF triaged this task as Medium priority.Oct 18 2021, 3:29 PM
MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.

The usual heuristic is to not fold letter that are part of the alphabet for a given language. For Czech, the list of letters not to fold is: Áá Čč Ďď Éé Ěě Íí Ňň Óó Řř Šš Ťť Úú Ůů Ýý Žž

However, for Slovak, we did end up folding everything (see T223787), though it initially caused problems for the stemmer. The problems were fixable by changing the order of the stemming and folding, but that's why we needed testing.

So, @Aklapper, you are saying that Czech searchers also expect diacritics not to matter when searching, right? If so, would you be able to help out with reviewing the changes when we get around to working on this?

I don't dare to speak on behalf of Czech searchers. :) When I have a keyboard layout available that has input support for Czech diacritics I am going to use that, but that's not always the case (e.g. using keyboards that are not mine when travelling, etc). Nothing important, I was just...surprised when I saw the search results.

I don't dare to speak on behalf of Czech searchers. :)

That's fair! But do you commonly observe search without diacritics matching words with diacritics? I assume so, since you expected it this time. Does Google do that? Do online retailers do that? If it's very common in other search products, especially the most popular ones, then it's a reasonable assumption that people expect it and we should try to support it. I can also go mining for examples in the query logs, but having a reasonable idea of what people likely expect is useful, too.

(The other option is that you travel too much without your own keyboard and it's just you—though that seems unlikely. ;)

do you commonly observe search without diacritics matching words with diacritics? I assume so, since you expected it this time. Does Google do that?

If I go for https://www.google.com/search?hl=en&q=burcak the first link is https://cs.wikipedia.org/wiki/Burčák

Do online retailers do that?

Randomly using e.g. alza.cz, entering "stredni" shows results for "Střední". Again, I don't dare to generalize... Shrug. :)

Thanks, Andre! It seems like a reasonable thing to look for evidence in the query logs and assess the impact on stemming and search results.

Yes, this behaviour is annoying a bit.
https://cs.wikipedia.org/w/index.php?title=Wikipedie:Pod_l%C3%ADpou_(technika)&oldid=21598450#Vyhled%C3%A1v%C3%A1n%C3%AD_%E2%80%9Ebez_hacku_a_carek%E2%80%9C

  • If I query for "jetrichov", the search gives me Jetřichovice and then Jetřichov on the 2nd place (I would expect it vice versa)
  • If I query for "kosire" the search gives me Kosice (r>c), but no mention about Košíře
  • If I add NS prefix (e.g. Šablona = Template) and I query for "sablona:infobox - sidlo" or "sablona:pahyl", I would expect e. g. Šablona:Infobox - sídlo světa or Šablona:Pahýl respectively, but search engine gives me nothing.
  • If I query for "ceske budejovice", it gives me nothing. I would expect České Budějovice.

It doesn't work at Special:Search either.

Search should be able to consider these letters as same Á > A, Č > C, Ď > D, É, Ě > E, Í > I, Ň > N, Ó > O, Ř > R, Š > S, Ť > T, Ú, Ů > U, Ý >Y, Ž >Z (with preference for the letters with diacritics: if I write "kosice", the first result should be Kosice and then Košice).

All Czech-language search engines (Google, Seznam, common e-shop/database searches) are doing it in this way.

So, @Aklapper, you are saying that Czech searchers also expect diacritics not to matter when searching, right? If so, would you be able to help out with reviewing the changes when we get around to working on this?

I agree that making diacritics not matter will make the results better (as @Draceane notes). However, I'd like to note that there are words that differ from each other only by diacritics. It's important to be able to reach articles about all of those terms, if such articles exist. My favorite example on this are the following three words:

  • Vláda (meaning: government)
  • Vláďa (meaning: a pet form of the male given names Vladimír and Vladislav)
  • Vlada (meaning: a female given name)

Is there any progress with this task?

All and any progress can be found in the corresponding task, thus no.

TJones raised the priority of this task from Medium to High.Oct 6 2023, 6:05 PM