de.wikipedia: search for "Bedusz" does not find "Będusz"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JStrodt_WMDE
	Jun 28 2019, 9:57 AM

Description

Hello,

I'm forwarding this issue from German Wikipedia:

I search for "Bedusz" on German Wikipedia.
I expect Gromada Będusz as a result, but it's not in the search results.

Apparently, the search on dewiki doesn't translate "ę" to "e". Having looked into the issue a little bit with @awight, it seems ICU folding could help here.

Is ICU folding in fact the right approach? What's the process to activate it? And what are the risks in activating it?

Thanks in advance,
Johanna

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T281379 Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Resolved	TJones	T284185 Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
Resolved	TJones	T226812 de.wikipedia: search for "Bedusz" does not find "Będusz"

Event Timeline

JStrodt_WMDE created this task.Jun 28 2019, 9:57 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2019, 9:57 AM

potentially related task: T226241

Reedy removed TJones as the assignee of this task.Jun 28 2019, 10:05 AM

Reedy added projects: Discovery-Search, MediaWiki-Search.

Reedy added a project: Advanced-Search.

Reedy edited subscribers, added: TJones; removed: Advanced-Search, Discovery-ARCHIVED.

Restricted Application added a project: TCB-Team (now WMDE-TechWish). · View Herald TranscriptJun 28 2019, 10:05 AM

Just wanted to mention here, that this also does not work when AdvancedSearch is "disabled" by deactivating JS. So this is probably not a problem, that is specific to the AdvancedSearch UI and probably lies further down.

In T226812#5291814, @WMDE-Fisch wrote:

Just wanted to mention here, that this also does not work when AdvancedSearch is "disabled" by deactivating JS. So this is probably not a problem, that is specific to the AdvancedSearch UI and probably lies further down.

That's my understanding, also. It seems to be part of the CirrusSearch configuration, and the basic idea is that we can turn on "folding", meaning that any non-ASCII characters with funny lines dangling around them will be searchable as the nátǐve spelling, or also as a plain spelling without the diacritic dongles. However, this might be annoying to some native speakers, for example being swamped with results for "hoho" when you really meant "höhö", so there is additional configuration in this file where we can make exceptions for letters that should not be folded.

Technically this will be easy to enable, I think the only obstacle is having a community discussion about the benefits and drawbacks.

MediaWiki-Search is not deployed on WMF wikis, hence removing that tag.

@JStrodt_WMDE wrote:

Is ICU folding in fact the right approach?

Very likely so!

@awight wrote:

Technically this will be easy to enable, I think the only obstacle is having a community discussion about the benefits and drawbacks.

It is, or can be, a little more complicated than that. Right now German-language wikis are configured with the [[ https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#german-analyzer | german ]] "monolithic" analyzer, which does everything all in one piece, but is not customizable.

Since it is an Elasticsearch analyzer it can be unpacked into its components (not all third-party analyzers can be), and those components can be customized. I've done this many times, and it usually goes well. However, there are some automatic "upgrades" that get applied to unpacked analuyzers, like converting the lowercase filter to ICU normalization (which is a less aggressive version of ICU folding). Sometimes the upgrades, or the changes you want to make can interact with other parts of the analysis chain in unexpected and undesirable ways.

A common problem is that replacing lowercasing with ICU normalization breaks stemming because the diacritics were important. Sometimes stemming interacts weirdly with ICU folding, and a perfectly obvious inflection doesn't get stemmed because of the presence of "foreign" diacritics. So, enabling and ordering ICU normalization and ICU folding can be a bit of an art, depending on the stemmer. I usually exempt any characters in the alphabet of the language from ICU folding in the file Adam mentioned—though in the case of Slovak, the community doesn't seem to want that (we're still exploring the repercussions of disabling it).

My usual process is to run a moderately sized sample from Wikipedia and Wiktionary (the body of 10K items from each) and look for potential problems—words that are unexpectedly indexed together that weren't before, or vice versa. When I have the obvious issues under control, I solicit feedback from speakers of the language on some random samples of changes, plus certain specific changes that are the most likely to be problematic. If it all looks good, then we can enable the changes and re-index to make them live.

I've moved this to the Language Stuff column on our work board, and I'll prioritize it in the list. I probably won't get to it terribly soon, but I may group it with some other German-language issues and move them up the list a bit.

Also of note, this is related to T104814: Appropriately ignore diacritics for German-language wikis, though the exact details do not overlap—but it would make sense to work on them together, and maybe bring T87136: ~"daß" should not match "dass" into the mix if it's now fixable.

EBernhardson triaged this task as Medium priority.Jul 11 2019, 5:09 PM

thiemowmde removed a project: TCB-Team (now WMDE-TechWish).Nov 2 2020, 9:28 AM

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.Mar 17 2021, 9:09 PM

After T281379 is deployed and T284185 is complete, recheck this ticket. I believe it should be fixed.

TJones mentioned this in T219550: [EPIC] Harmonize language analysis across languages.Jun 3 2021, 10:15 PM

Gehel added a parent task: T284185: Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions.Jun 9 2021, 3:09 PM

TJones claimed this task.Jun 22 2021, 4:04 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

This was fixed as part of unpacking the German analyzer and enabling ICU normalization/folding in T281379 and made live by T284185.

TJones mentioned this in T104814: Appropriately ignore diacritics for German-language wikis.Jun 22 2021, 4:28 PM

Gehel closed this task as Resolved.Jul 26 2021, 12:18 PM

de.wikipedia: search for "Bedusz" does not find "Będusz"Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

de.wikipedia: search for "Bedusz" does not find "Będusz"
Closed, ResolvedPublic
Actions

Related Objects
Search...