Searching for index in https://fr.wikisource.org/wiki/Auteur:Pierre_Lou%C3%BFs doesn't work. The link is wrong. It should be https://fr.wikisource.org/w/index.php?title=Sp%C3%A9cial%3AIndexPages&limit=100&key=Pierre+intitle%3ALou%C3%BFs&order=quality
- Mentioned In
- T147505: [EPIC][Recurring task] CirrusSearch: what is updated during re-indexing
T142037: Test effect of re-ordering kstem and asciifolding on English Wikipedia
- Mentioned Here
- T144429: Commit changes to implement ascii-folding for French
T142620: Test effect of adding ascii-folding on French Wikipedia
T137830: Use the icu_folding filter if available instead of asciifolding
Here is more information. The Special:IndexPages in the French Wikisource assimilates accentuated characters and non-accentuated characters when searching For example, a search on "intitle:Molière" and "intitle:Moliere" seem to produce the same results, which is good.
But this behavior does not work for "ÿ", which should be assimilated to "y" when searching. The search should handle correctly all accentuated letters in (at least) the Latin-1 Supplement Unicode block.
(I don't know if this is a Wikimedia issue or if it should be handled by someone in the French Wikisource.)
FTR: I don't think this is related to icu/asciifolding.
asciifolding is properly enabled and used with intitle.
The problem is subtle and caused because we set asciifolding_preserve after kstem. and it appears to ignore terms with diacritics :
At index time: Louÿs => (kstem) => Louÿs => (ascifolding_preserve) => Louÿs|Louys
At query time: Louys => (kstem) => Louy => (ascifolding_preserve) => Louy
Louy will never match any of the terms generated at index time.
- Solution would be to move asciifolding before kstem (reindex needed).
- Or to include title.plain in the filter (hackish but no reindex needed)
I overlooked the wiki mentioned in this wiki, the previous comment (kstem and asciifoldind filter ordering) is for english wikis.
In this case analysis config for french does not include any asciifolding for the stem field. We should investigate adding asciifolding/icufolding to the french analysis chain.
I'm looking into this as part of T142620, and I've discovered some interesting things about Elasticsearch's default French analysis chain. There is ascii-folding for some characters (á â à é ê è î ô û ù ç), but not others (ä ë í ï ì ó ö ò ú ü ÿ œ æ). The tréma/umlaut/diaeresis doesn't ever get folded, and that's causing trouble for Louÿs.
I believe this is fixed as a result of T144429. My local vagrant/mediawiki language is set to French. I created a new page "Pierre Louÿs". It comes back when I search for any of the following:
Please note that this will not immediately fix the problem in Wikisource because the wiki needs to be re-indexed for the changes to take effect. Fortunately, we're planning a re-index soon for BM25, so this should go live by the end of the quarter (Sept 30, 2016) if there are no delays.