Page MenuHomePhabricator

feature request: replace forbidden characters with lookalike UTF8 signs in the wikipedia search input control
Closed, ResolvedPublic

Description

Author: michael.manner

Description:
replace forbidden characters with lookalike UTF8 signs in the wikipedia search input field [alt-F].

Here are some alternativs:
mayor:

  • # → ⧣ (⧣) EQUALS SIGN AND SLANTED PARALLEL (U+29E3) ⧣

With this replacements wouldt it be possible do find article titles like "C#"

minors:

  • < → &#8249; (‹) SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (U+2039) &amp;#8249;
  • > → &#8250; (›) SINGLE LEFT-POINTING ANGLE QUOTATION MARK (U+203A) &amp;#8250;
  • | → &#8739; (∣) DIVIDES (U+2223) &amp;#8739;
  • { → &#10100; (❴) MEDIUM LEFT CURLY BRACKET ORNAMENT (U+2774) &amp;#10100;
  • } → &#10101; (❵) MEDIUM RIGHT CURLY BRACKET ORNAMENT (U+2775) &amp;#10101;

no alternativs found:

  • [
  • [

Only the CJK Characters would be available, but the arn't supported by a large number of fonts.


Version: unspecified
Severity: enhancement

Details

Reference
bz36954

Event Timeline

bzimport raised the priority of this task from to Low.
bzimport set Reference to bz36954.
bzimport added a subscriber: Unknown Object (MLST).

This sounds like something that would get in the way of AntiSpoof.

mr.heat wrote:

  • This bug has been confirmed by popular vote. ***
Restricted Application added a project: Discovery-Search. · View Herald TranscriptJul 31 2017, 9:31 PM
TJones closed this task as Resolved.Jan 30 2019, 10:18 PM
TJones claimed this task.
TJones added a subscriber: TJones.

I'm going to close this because it was written before we moved to Elasticsearch. The current behavior of Elasticsearch is the same for both these characters and their proposed normalization: all of are ignored during tokenization. In general, we have implemented ICU Normalization for English-language projects, so most non-punctuation characters are normalized well.

If the goal is to be able to find these specific characters, see T211824: Investigate a “rare-character” index.