Page MenuHomePhabricator

Full text search does not find article with accented word in dewiki
Closed, ResolvedPublic

Description

Full text search https://de.wikipedia.org/w/index.php?search=Eugenie+Grandet&title=Spezial:Suche&profile=default&fulltext=1 Eugenie Grandet does not find article Eugénie Grandet in dewiki.

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptOct 26 2017, 4:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Normal priority.Oct 26 2017, 5:02 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt added subscribers: TJones, debt.

@TJones can take a look at this.

@FriedhelmW, can you post a screenshot or more detailed description of what you are seeing that is wrong? Or maybe another example? When I follow the link you provided, the article for Eugénie Grandet is the first result:

Eugénie is not bolded in the title of the result, but the article was found.

Searching in the upper right search box (the "Go" search) with Eugenie Grandet reveals a redirect to the correct page, which isn't helpful. However, a "Go" search for álbért éínstéín takes me to Albert Einstein so Go search is treating accented characters correctly for title matches.

The German language analyzer (which performs stemming and character normalization) for full-text search does treat e's with diacritics (specifically é è ë ê) differently than other vowels with the same diacritics. More details on that are in a comment on T104814.

Hopefully this is either not a problem, as Eugénie Grandet is currently found, or at least it is a variant of T104814.

On 26. Okt. 2017, 19:06‎ someone wrote a redirect without accent. Maybe this explains the different search result. Should we change the documentation which says that accents are ignored?

D'oh—thanks @FriedhelmW, I didn't check for that. The busy, busy WikiGnomes are always fixing things. So, I'd say that this is a specific example of what's happening with e's in T104814. That ticket is on my list for this year. Is it okay to close this ticket and/or fold it into T104814?

FriedhelmW closed this task as Resolved.Oct 27 2017, 4:39 PM
FriedhelmW claimed this task.

Yes, and I will change the documentation.

@FriedhelmW can you point me at the documentation you want to change? If you are referring to "Folds character families. Diacritical folding automatically matches foreign terms" then I agree it should be updated, but please be careful not to make it incorrect in a different way. Diacritical folding is turned on for most languages, though the set of characters that are folded differs from language to language.

Do you want to comment on the other e's that are also not folded correctly? I'll add a note to the other ticket to fix this documentation.