Page MenuHomePhabricator

Unicode digits are unsearchable
Open, LowPublic

Description

On Wikipedia, km² is impossible to target in a search, yet Goggle reports "km²" on well over 120,000 pages.

But Unicode digits

  • have not been normalized. Basic search "mm3" or "km2" find no normalized ² or ³ character in the index.
  • are treated like punctuation. Basic search "mm³" finds mm.
  • fail in regex strings greater than two chars. /mm³/ or /km²/ are missing out.

Major templates such as Convert and Val supports unicode digits in either form km² or km2. In mainspace, 5% of pages who use <sup>2 also use ².

Confusingly, km² is recognized by the highlighter, but when you remove the actual matches (single unicode strings) ²|³... nothing.
For example, see insource:/²|³|km²/ prefix:Chem. Also the typeahead analyzer works fine for or mm³ or km².

To see how two is ok but three fails, and without running bare regex on millions of pages, here's a small domain with some /²|³/ hits.

T41501 says unicode quotes are not normalized, and this one says ² and ³ are not normalized. But digits are indexed and quotes are not.

T95849 considers analyzers, filtering, and fields, and shows enwiki page mapping properties while troubleshooting the unicode ★ character.
But the black star, although not found in indexed searches, is not impossible to find using regex,
and other unicode characters are also found in regex strings.

Event Timeline

Cpiral created this task.Nov 28 2015, 8:54 PM
Cpiral raised the priority of this task from to Needs Triage.
Cpiral updated the task description. (Show Details)
Cpiral added a subscriber: Cpiral.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 28 2015, 8:54 PM
Cpiral renamed this task from Unicode superscript numbers are ignored by both regex and indexed searches normalized and is ignored by regex to Unicode superscript numbers are ignored by both regex and indexed searches .Nov 28 2015, 9:17 PM
Cpiral set Security to None.
Restricted Application added a project: Discovery. · View Herald TranscriptNov 28 2015, 9:19 PM
Cpiral updated the task description. (Show Details)EditedNov 28 2015, 9:50 PM

T95849 shows a direct analysis of a single unicode character from 6 mos ago, but has been put "up for grabs" twice, saying

  • insource:/★/ shows 35, but misses Emoji and Miscellaneous Symbols.
  • insource:★ finds nothing
  • type-ahead-searching for finds three titles that begin with that char, but intitle: ★ finds nothing.

I would add that prefix:★ finds nothing, and probably no indexed search handles unicode.

Cpiral renamed this task from Unicode superscript numbers are ignored by both regex and indexed searches to Unicode superscript numbers are usually off.Nov 29 2015, 9:13 AM
Cpiral updated the task description. (Show Details)
Cpiral updated the task description. (Show Details)Nov 30 2015, 5:22 AM
Cpiral updated the task description. (Show Details)Nov 30 2015, 7:10 AM
Cpiral updated the task description. (Show Details)Nov 30 2015, 8:14 AM
Cpiral renamed this task from Unicode superscript numbers are usually off to Unicode digits are unsearchable.Dec 1 2015, 10:37 AM
Cpiral updated the task description. (Show Details)
Cpiral updated the task description. (Show Details)Dec 2 2015, 12:05 AM
Deskana triaged this task as Low priority.Dec 4 2015, 5:29 AM
Deskana moved this task from Needs triage to Search on the Discovery board.
Deskana added a subscriber: Deskana.