Page MenuHomePhabricator

de.wikipedia: search for "Bedusz" does not find "Będusz"
Closed, ResolvedPublic

Description

Hello,

I'm forwarding this issue from German Wikipedia:

  1. I search for "Bedusz" on German Wikipedia.
  2. I expect Gromada Będusz as a result, but it's not in the search results.

Apparently, the search on dewiki doesn't translate "ę" to "e". Having looked into the issue a little bit with @awight, it seems ICU folding could help here.

Is ICU folding in fact the right approach? What's the process to activate it? And what are the risks in activating it?

Thanks in advance,
Johanna

Event Timeline

Reedy added a project: Advanced-Search.
Reedy edited subscribers, added: TJones; removed: Advanced-Search, Discovery-ARCHIVED.

Just wanted to mention here, that this also does not work when AdvancedSearch is "disabled" by deactivating JS. So this is probably not a problem, that is specific to the AdvancedSearch UI and probably lies further down.

Just wanted to mention here, that this also does not work when AdvancedSearch is "disabled" by deactivating JS. So this is probably not a problem, that is specific to the AdvancedSearch UI and probably lies further down.

That's my understanding, also. It seems to be part of the CirrusSearch configuration, and the basic idea is that we can turn on "folding", meaning that any non-ASCII characters with funny lines dangling around them will be searchable as the nátǐve spelling, or also as a plain spelling without the diacritic dongles. However, this might be annoying to some native speakers, for example being swamped with results for "hoho" when you really meant "höhö", so there is additional configuration in this file where we can make exceptions for letters that should not be folded.

Technically this will be easy to enable, I think the only obstacle is having a community discussion about the benefits and drawbacks.

MediaWiki-Search is not deployed on WMF wikis, hence removing that tag.

@JStrodt_WMDE wrote:

Is ICU folding in fact the right approach?

Very likely so!

@awight wrote:

Technically this will be easy to enable, I think the only obstacle is having a community discussion about the benefits and drawbacks.

It is, or can be, a little more complicated than that. Right now German-language wikis are configured with the [[ https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#german-analyzer | german ]] "monolithic" analyzer, which does everything all in one piece, but is not customizable.

Since it is an Elasticsearch analyzer it can be unpacked into its components (not all third-party analyzers can be), and those components can be customized. I've done this many times, and it usually goes well. However, there are some automatic "upgrades" that get applied to unpacked analuyzers, like converting the lowercase filter to ICU normalization (which is a less aggressive version of ICU folding). Sometimes the upgrades, or the changes you want to make can interact with other parts of the analysis chain in unexpected and undesirable ways.

A common problem is that replacing lowercasing with ICU normalization breaks stemming because the diacritics were important. Sometimes stemming interacts weirdly with ICU folding, and a perfectly obvious inflection doesn't get stemmed because of the presence of "foreign" diacritics. So, enabling and ordering ICU normalization and ICU folding can be a bit of an art, depending on the stemmer. I usually exempt any characters in the alphabet of the language from ICU folding in the file Adam mentioned—though in the case of Slovak, the community doesn't seem to want that (we're still exploring the repercussions of disabling it).

My usual process is to run a moderately sized sample from Wikipedia and Wiktionary (the body of 10K items from each) and look for potential problems—words that are unexpectedly indexed together that weren't before, or vice versa. When I have the obvious issues under control, I solicit feedback from speakers of the language on some random samples of changes, plus certain specific changes that are the most likely to be problematic. If it all looks good, then we can enable the changes and re-index to make them live.

I've moved this to the Language Stuff column on our work board, and I'll prioritize it in the list. I probably won't get to it terribly soon, but I may group it with some other German-language issues and move them up the list a bit.

Also of note, this is related to T104814: Appropriately ignore diacritics for German-language wikis, though the exact details do not overlap—but it would make sense to work on them together, and maybe bring T87136: ~"daß" should not match "dass" into the mix if it's now fixable.

After T281379 is deployed and T284185 is complete, recheck this ticket. I believe it should be fixed.

This was fixed as part of unpacking the German analyzer and enabling ICU normalization/folding in T281379 and made live by T284185.