Searching for index in https://fr.wikisource.org/wiki/Auteur:Pierre_Lou%C3%BFs doesn't work. The link is wrong. It should be https://fr.wikisource.org/w/index.php?title=Sp%C3%A9cial%3AIndexPages&limit=100&key=Pierre+intitle%3ALou%C3%BFs&order=quality
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Deskana | T139575 EPIC: Plan to enable BM25 on fulltext search | |||
Resolved | TJones | T141216 ÿ in Spécial:IndexPages search | |||
Resolved | TJones | T142620 Test effect of adding ascii-folding on French Wikipedia |
Event Timeline
I don't understand what you are trying to do @Yann. Might be worth going back and stepwise explain what you are looking to achieve.
Here is more information. The Special:IndexPages in the French Wikisource assimilates accentuated characters and non-accentuated characters when searching For example, a search on "intitle:Molière" and "intitle:Moliere" seem to produce the same results, which is good.
But this behavior does not work for "ÿ", which should be assimilated to "y" when searching. The search should handle correctly all accentuated letters in (at least) the Latin-1 Supplement Unicode block.
(I don't know if this is a Wikimedia issue or if it should be handled by someone in the French Wikisource.)
@debt, I believe that this has been clarified.
- association to be made between y and ÿ
and I think that there is an indication that can each roman letter [A-Za-z] should have an association for each variation of the same letter with grave, acute, macron, ...
I think intitle is a bit restrictive and appears to disable accent folding.
I'd suggest to investigate further while working on T137830.
FTR: I don't think this is related to icu/asciifolding.
asciifolding is properly enabled and used with intitle.
The problem is subtle and caused because we set asciifolding_preserve after kstem. and it appears to ignore terms with diacritics :
At index time: Louÿs => (kstem) => Louÿs => (ascifolding_preserve) => Louÿs|Louys
At query time: Louys => (kstem) => Louy => (ascifolding_preserve) => Louy
Louy will never match any of the terms generated at index time.
- Solution would be to move asciifolding before kstem (reindex needed).
- Or to include title.plain in the filter (hackish but no reindex needed)
I overlooked the wiki mentioned in this wiki, the previous comment (kstem and asciifoldind filter ordering) is for english wikis.
In this case analysis config for french does not include any asciifolding for the stem field. We should investigate adding asciifolding/icufolding to the french analysis chain.
I'm looking into this as part of T142620, and I've discovered some interesting things about Elasticsearch's default French analysis chain. There is ascii-folding for some characters (á â à é ê è î ô û ù ç), but not others (ä ë í ï ì ó ö ò ú ü ÿ œ æ). The tréma/umlaut/diaeresis doesn't ever get folded, and that's causing trouble for Louÿs.
I believe this is fixed as a result of T144429. My local vagrant/mediawiki language is set to French. I created a new page "Pierre Louÿs". It comes back when I search for any of the following:
- louÿs
- louys
- intitle:louÿs
- intitle:louys
Please note that this will not immediately fix the problem in Wikisource because the wiki needs to be re-indexed for the changes to take effect. Fortunately, we're planning a re-index soon for BM25, so this should go live by the end of the quarter (Sept 30, 2016) if there are no delays.