Page MenuHomePhabricator

Function to replace invisible characters with blank
Closed, ResolvedPublic

Description

AntiSpoof's current design only allows replacing one character with another. But what I am proposing here is to replace a character with nothing (blank). So we need new functionality for it.

Example: The following texts look the same but are different: ABCD , A‌B‌C‌D

The first one is only four characters A to D; the second has a &znwj; (Unicode character 200C) between each two English letters. Replacing ZWNJ with another character is NOT the solution. Instead, we need a function that can replace ZWNJ with nothing.

This applies to at least the following characters, all of which are "invisible" (i.e. they don't have any width):

  • Zero-width space (200B)
  • Zero-width non-joiner (200C)
  • Zero-width joiner (200D)
  • Left-to-right mark (200E)
  • Right-to-left mark (200F)
  • Line separator (2028)
  • Paragraph separator (2029)
  • Left-to-right embedding (202A)
  • Right-to-left embedding (202B)
  • Left-to-right override (202D)
  • Right-to-left override (202E)

And perhaps:

  • Left-to-right isolate (2066)
  • Right-to-left isolate (2067)
  • First strong isolate (2068)
  • Pop directional isolate (2069)

The top table on https://en.wikibooks.org/wiki/Unicode/Character_reference/2000-2FFF can be a good reference.

Event Timeline

Huji added subscribers: dmaza, Legoktm.

It might make sense to postpone fixing this until T174197 is resolved, notifying @dmaza and @Legoktm to opine as well.

Huji updated the task description. (Show Details)

Probably the character 2062 (INVISIBLE TIMES) should be added to the list. For example, https://community.wikia.com/wiki/Special:Contributions/Low_Spark_of_Lyman%E2%81%A2%E2%81%A2 has this character appended to this user name, but it's visually indistinguishable from https://community.wikia.com/wiki/Special:Contributions/Low_Spark_of_Lyman .

Change 537778 had a related patch set uploaded (by TK-999; owner: TK-999):
[mediawiki/libs/Equivset@master] Extend the blacklist of invisible characters

https://gerrit.wikimedia.org/r/537778

When the invisible character are no longer allowed on page titles those are also invalid in user names, which makes AntiSpoof fixed as well

See T44807: Invisible Unicode characters allowed on pagetitle (\u200E | \uFEFF | \u200B) and sub tasks

But equivset already has some replacement, it should be okay to add more (from T171846)

de.wp has an arctile about such characters - https://de.wikipedia.org/wiki/Bidirektionales_Steuerzeichen

Or
https://en.wikipedia.org/wiki/General_Punctuation / https://de.wikipedia.org/wiki/Unicodeblock_Allgemeine_Interpunktion

Change 537778 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Extend the blacklist of invisible characters

https://gerrit.wikimedia.org/r/537778

Umherirrender assigned this task to TK-999.