ÿ in Spécial:IndexPages search
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Yann
	Jul 24 2016, 9:34 AM

Description

Searching for index in https://fr.wikisource.org/wiki/Auteur:Pierre_Lou%C3%BFs doesn't work. The link is wrong. It should be https://fr.wikisource.org/w/index.php?title=Sp%C3%A9cial%3AIndexPages&limit=100&key=Pierre+intitle%3ALou%C3%BFs&order=quality

Related Objects
Search...

Status	Assigned	Task
Resolved	• Deskana	T139575 EPIC: Plan to enable BM25 on fulltext search
Resolved	TJones	T141216 ÿ in Spécial:IndexPages search
Resolved	TJones	T142620 Test effect of adding ascii-folding on French Wikipedia

Event Timeline

Yann created this task.Jul 24 2016, 9:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2016, 9:34 AM

Yann added a project: All-and-every-Wikisource.Jul 24 2016, 9:34 AM

Peachey88 added projects: ProofreadPage, CirrusSearch.Jul 24 2016, 10:54 AM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptJul 24 2016, 10:54 AM

I don't understand what you are trying to do @Yann. Might be worth going back and stepwise explain what you are looking to achieve.

Here is more information. The Special:IndexPages in the French Wikisource assimilates accentuated characters and non-accentuated characters when searching For example, a search on "intitle:Molière" and "intitle:Moliere" seem to produce the same results, which is good.

But this behavior does not work for "ÿ", which should be assimilated to "y" when searching. The search should handle correctly all accentuated letters in (at least) the Latin-1 Supplement Unicode block.

(I don't know if this is a Wikimedia issue or if it should be handled by someone in the French Wikisource.)

Triaging as low as we're waiting for more information at this point.

In T141216#2503712, @debt wrote:

Triaging as low as we're waiting for more information at this point.

what more info from who?

@debt, I believe that this has been clarified.

association to be made between y and ÿ

and I think that there is an indication that can each roman letter [A-Za-z] should have an association for each variation of the same letter with grave, acute, macron, ...

Billinghurst unsubscribed.Jul 29 2016, 10:06 AM

I think intitle is a bit restrictive and appears to disable accent folding.
I'd suggest to investigate further while working on T137830.

FTR: I don't think this is related to icu/asciifolding.
asciifolding is properly enabled and used with intitle.

The problem is subtle and caused because we set asciifolding_preserve after kstem. and it appears to ignore terms with diacritics :
At index time: Louÿs => (kstem) => Louÿs => (ascifolding_preserve) => Louÿs|Louys
At query time: Louys => (kstem) => Louy => (ascifolding_preserve) => Louy

Louy will never match any of the terms generated at index time.

Solution would be to move asciifolding before kstem (reindex needed).
Or to include title.plain in the filter (hackish but no reindex needed)

TJones mentioned this in T142037: Test effect of re-ordering kstem and asciifolding on English Wikipedia.Aug 3 2016, 8:10 PM

We'll look at this along with T137830

I overlooked the wiki mentioned in this wiki, the previous comment (kstem and asciifoldind filter ordering) is for english wikis.
In this case analysis config for french does not include any asciifolding for the stem field. We should investigate adding asciifolding/icufolding to the french analysis chain.

debt added a parent task: T142620: Test effect of adding ascii-folding on French Wikipedia.Aug 11 2016, 4:36 PM

debt removed a parent task: T142620: Test effect of adding ascii-folding on French Wikipedia.

debt added a subtask: T142620: Test effect of adding ascii-folding on French Wikipedia.

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.Aug 11 2016, 4:38 PM

debt added a parent task: T139575: EPIC: Plan to enable BM25 on fulltext search.

I'm looking into this as part of T142620, and I've discovered some interesting things about Elasticsearch's default French analysis chain. There is ascii-folding for some characters (á â à é ê è î ô û ù ç), but not others (ä ë í ï ì ó ö ò ú ü ÿ œ æ). The tréma/umlaut/diaeresis doesn't ever get folded, and that's causing trouble for Louÿs.

EBernhardson subscribed.Aug 15 2016, 10:10 PM

debt closed subtask T142620: Test effect of adding ascii-folding on French Wikipedia as Resolved.Sep 1 2016, 8:52 PM

TJones claimed this task.Sep 12 2016, 3:54 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

I believe this is fixed as a result of T144429. My local vagrant/mediawiki language is set to French. I created a new page "Pierre Louÿs". It comes back when I search for any of the following:

louÿs
louys
intitle:louÿs
intitle:louys

Please note that this will not immediately fix the problem in Wikisource because the wiki needs to be re-indexed for the changes to take effect. Fortunately, we're planning a re-index soon for BM25, so this should go live by the end of the quarter (Sept 30, 2016) if there are no delays.

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Sep 12 2016, 9:09 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Sep 15 2016, 11:03 PM

debt closed this task as Resolved.Sep 16 2016, 6:41 PM

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Oct 6 2016, 6:04 PM

Billinghurst moved this task from Backlog to Done: to deploy/check on the ProofreadPage board.Mar 5 2017, 2:05 AM

ÿ in Spécial:IndexPages searchClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

ÿ in Spécial:IndexPages search
Closed, ResolvedPublic
Actions

Related Objects
Search...