Searching for Sapulpa-Beggs should find "U.S. Route 75 Alternate (Beggs–Sapulpa, Oklahoma)". Without splitting words on dashes it has no chance.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Harmonize Hyphens and ZWSPs in Base Thai | mediawiki/extensions/CirrusSearch | master | +19 -4 |
Related Objects
- Mentioned In
- T405020: Harmonize Invisibles in Cirrus Language Analysis
- Mentioned Here
- T405020: Harmonize Invisibles in Cirrus Language Analysis
Event Timeline
@Gehel, we use the icu_tokenizer (or our version of it) almost everywhere, and it does split on lots of hyphens and dashes correctly. I tested a bunch of dash-like symbols (- ‐ ‑ ﹣ - ‒ – — ゠ ⹀), and tested all of the tokenizers we use: ICU, standard, smartCN, Hebrew, sudachi, Kuromoji, Nori, Thai.
The Thai tokenizer (which is not used in prod, we use the ICU tokenizer when available) doesn't split on some rarer hypen-like characters (‑ ﹣ ‒ ⹀ ゠), though at least 4 of the 5 occur in thwiki.
I also tested the soft hyphen, which opened another can of worms on invisibles (see T405020).
Since figuring out the situation was ⅔ of fixing it, I'm going to claim this task and put up a patch.
Change #1189542 had a related patch set uploaded (by Tjones; author: Tjones):
[mediawiki/extensions/CirrusSearch@master] Harmonize Hyphens and ZWSPs in Base Thai
Change #1189542 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Harmonize Hyphens and ZWSPs in Base Thai