Page MenuHomePhabricator

CirrusSearch: If user searches with a dash in the word then filter to only words with the dash
Closed, ResolvedPublic

Description

If user searches an accent squashing wiki with an accented string then only return accented results. Example:
Search for <<clientèle>> should only find pages with <<clientèle>>
Search for <<clientele>> should find page with <<clientele>> and <<clientèle>>

Option: only enable this behaviour when a string is quoted. Quoting is standard parlance for "please give me an exact match". We still want quoted unaccented strings to find the accented characters.


Version: unspecified
Severity: normal
See Also:
https://github.com/elasticsearch/elasticsearch/issues/4931
https://issues.apache.org/jira/browse/LUCENE-5437
https://bugzilla.wikimedia.org/show_bug.cgi?id=63633

Details

Reference
bz60299

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 2:53 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz60299.

Also, LuceneSearch has special handling for hyphenated words that pretty much does the same thing as I'm proposing for accents. It looks like it only does it for "exact" tokens. In CirrusSearch we call those "plain" tokens. See FastWikiTokenizerEngine.java:332 for more.

I've opened a bug in Elasticsearch for this but it needs to be fixed in their upstream, Lucene, so I've opened a bug there and begun work.

Both upstream bugs were closed. Is anything stopping this?

(In reply to Nik Everett from comment #0)

Search for <<clientèle>> should only find pages with <<clientèle>>
Search for <<clientele>> should find page with <<clientele>> and
<<clientèle>>

Can the two only be fixed together? The first may not be that important as long as exact matches come first.

On the other hand, the second has been requested repeatedly by several it.wiktionary users.

  • Searching "macor" should find "mačor" but it doesn't, one has to search ma*or.
  • Searching "tamen" should also find "tāmen", ideally in autocompletion suggestions too.

https://it.wiktionary.org/wiki/Wikizionario:Bar/Archivio/2013-dic#Nuovo_motore_di_ricerca_interno
https://it.wiktionary.org/w/index.php?title=Wikizionario:Bar&diff=1673990&oldid=1673091

(In reply to Nemo from comment #3)

Both upstream bugs were closed. Is anything stopping this?

Just timing. I'm upgrading the cluster tomorrow. I don't like to merge code that won't work on the cluster so I haven't even picked this one up again since working on the upstream bug. I'll do it, though.

Can the two only be fixed together? The first may not be that important as
long as exact matches come first.
On the other hand, the second has been requested repeatedly by several
it.wiktionary users.

  • Searching "macor" should find "mačor" but it doesn't, one has to search

ma*or.

  • Searching "tamen" should also find "tāmen", ideally in autocompletion

suggestions too.
https://it.wiktionary.org/wiki/Wikizionario:Bar/Archivio/2013-
dic#Nuovo_motore_di_ricerca_interno
https://it.wiktionary.org/w/index.php?title=Wikizionario:
Bar&diff=1673990&oldid=1673091

Only English has any sort of accent squashing at this point. I can build it for you soon.

The bug is actually about the first. Right now, in English, if you search for "mačor" you'll get "macor" which frustrates some folks.

Change 126995 had a related patch set uploaded by Manybubbles:
Quoted searches with accents only find accented

https://gerrit.wikimedia.org/r/126995

That only covers the accent squashing, not the dashes. Dashes are harder, I think. I'll have to think about them some more....

Change 126995 merged by jenkins-bot:
Quoted searches with accents only find accented

https://gerrit.wikimedia.org/r/126995

Allowing dash searching using regexes which are being deployed now.