Page MenuHomePhabricator

Full text search needs to split words on dashes
Closed, ResolvedPublic2 Estimated Story Points

Description

Searching for Sapulpa-Beggs should find "U.S. Route 75 Alternate (Beggs–Sapulpa, Oklahoma)". Without splitting words on dashes it has no chance.

Event Timeline

Manybubbles raised the priority of this task from to Needs Triage.
Manybubbles updated the task description. (Show Details)
Manybubbles subscribed.

@TJones does the various ICU analysis steps take care of this already?

TJones set the point value for this task to 2.

@Gehel, we use the icu_tokenizer (or our version of it) almost everywhere, and it does split on lots of hyphens and dashes correctly. I tested a bunch of dash-like symbols (- ‐ ‑ ﹣ - ‒ – — ゠ ⹀), and tested all of the tokenizers we use: ICU, standard, smartCN, Hebrew, sudachi, Kuromoji, Nori, Thai.

The Thai tokenizer (which is not used in prod, we use the ICU tokenizer when available) doesn't split on some rarer hypen-like characters (‑ ﹣ ‒ ⹀ ゠), though at least 4 of the 5 occur in thwiki.

I also tested the soft hyphen, which opened another can of worms on invisibles (see T405020).

Since figuring out the situation was ⅔ of fixing it, I'm going to claim this task and put up a patch.

Change #1189542 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize Hyphens and ZWSPs in Base Thai

https://gerrit.wikimedia.org/r/1189542

Change #1189542 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize Hyphens and ZWSPs in Base Thai

https://gerrit.wikimedia.org/r/1189542