Page MenuHomePhabricator

ÿ in Spécial:IndexPages search
Closed, ResolvedPublic

Event Timeline

Yann created this task.Jul 24 2016, 9:34 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2016, 9:34 AM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptJul 24 2016, 10:54 AM

I don't understand what you are trying to do @Yann. Might be worth going back and stepwise explain what you are looking to achieve.

Seudo added a subscriber: Seudo.EditedJul 24 2016, 4:06 PM

Here is more information. The Special:IndexPages in the French Wikisource assimilates accentuated characters and non-accentuated characters when searching For example, a search on "intitle:Molière" and "intitle:Moliere" seem to produce the same results, which is good.

But this behavior does not work for "ÿ", which should be assimilated to "y" when searching. The search should handle correctly all accentuated letters in (at least) the Latin-1 Supplement Unicode block.

(I don't know if this is a Wikimedia issue or if it should be handled by someone in the French Wikisource.)

debt triaged this task as Low priority.Jul 28 2016, 10:21 PM
debt added a subscriber: debt.

Triaging as low as we're waiting for more information at this point.

jayvdb raised the priority of this task from Low to Needs Triage.Jul 29 2016, 1:58 AM
jayvdb added a subscriber: jayvdb.

Triaging as low as we're waiting for more information at this point.

what more info from who?

@debt, I believe that this has been clarified.

  • association to be made between y and ÿ

and I think that there is an indication that can each roman letter [A-Za-z] should have an association for each variation of the same letter with grave, acute, macron, ...

I think intitle is a bit restrictive and appears to disable accent folding.
I'd suggest to investigate further while working on T137830.

FTR: I don't think this is related to icu/asciifolding.
asciifolding is properly enabled and used with intitle.

The problem is subtle and caused because we set asciifolding_preserve after kstem. and it appears to ignore terms with diacritics :
At index time: Louÿs => (kstem) => Louÿs => (ascifolding_preserve) => Louÿs|Louys
At query time: Louys => (kstem) => Louy => (ascifolding_preserve) => Louy

Louy will never match any of the terms generated at index time.

  1. Solution would be to move asciifolding before kstem (reindex needed).
  2. Or to include title.plain in the filter (hackish but no reindex needed)
debt triaged this task as Medium priority.Aug 4 2016, 5:17 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.

We'll look at this along with T137830

I overlooked the wiki mentioned in this wiki, the previous comment (kstem and asciifoldind filter ordering) is for english wikis.
In this case analysis config for french does not include any asciifolding for the stem field. We should investigate adding asciifolding/icufolding to the french analysis chain.

TJones added a subscriber: TJones.Aug 12 2016, 7:27 PM

I'm looking into this as part of T142620, and I've discovered some interesting things about Elasticsearch's default French analysis chain. There is ascii-folding for some characters (á â à é ê è î ô û ù ç), but not others (ä ë í ï ì ó ö ò ú ü ÿ œ æ). The tréma/umlaut/diaeresis doesn't ever get folded, and that's causing trouble for Louÿs.

I believe this is fixed as a result of T144429. My local vagrant/mediawiki language is set to French. I created a new page "Pierre Louÿs". It comes back when I search for any of the following:

  • louÿs
  • louys
  • intitle:louÿs
  • intitle:louys

Please note that this will not immediately fix the problem in Wikisource because the wiki needs to be re-indexed for the changes to take effect. Fortunately, we're planning a re-index soon for BM25, so this should go live by the end of the quarter (Sept 30, 2016) if there are no delays.

debt closed this task as Resolved.Sep 16 2016, 6:41 PM
Billinghurst moved this task from Backlog to Done on the ProofreadPage board.Mar 5 2017, 2:05 AM