Page MenuHomePhabricator

UTF8 homoglyph in titles
Closed, ResolvedPublic

Description

In the last time vandals create many articles which seems to have exactly the
same title. Unfortunately they insert unvisible UTF8 characters into the title.
So we get many diffferent articles which titles looks like say "Karin Stoiber".

Software should refuse creating articles with a title which includes an
unvisible character - of course "blank" must be an exception.

Vandal problem occured in de-WP. My nickname is tsor, I am administrator.

tsor


Version: unspecified
Severity: trivial

Details

Reference
bz2042

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:29 PM
bzimport set Reference to bz2042.
bzimport added a subscriber: Unknown Object (MLST).

foenyx wrote:

*** This bug has been marked as a duplicate of 1414 ***

foenyx wrote:

When you said "unvisible UTF8 characters" I thought you were talking about some
whitespace utf8 characters, but (as you explained in bug 1414) you are talking
about characters which look like some latin characters (eg 'greek kappa' like
'K', or 'cyrillic small dze' like 's', etc .. -> [[w:en:Homoglyph]] ).

Well using non latin utf8 characters in titles is not a bug .. it's a feature.

Some wiki, like fr: use a lot of non latin char in the titles (usually it
redirects to a romanized normalised title). Moreover the homoglyph problem
already existed with l (L) and I (i) loot at [[w:de:Ill (Elsass)]] ; some
vandals can create a page "Johannes Paul ll" (Johannes Paul II) most users wont
notice.

As it's somewhat related to punycode/IDN firefox 1.0.1 problem look at mozilla
discussions :

We could try the suggested :

  • "Measurements of lexical proximity" with an older article title (helped with a

list of utf8 homograph pair)

  • "Domain letter colouring", hilighting, tooltips above chars showing which

unicode bloc they belong to. Or we could hilight/warn only unusual utf8
characters but this could required to define the list of frequently used char
per wiki.

I change the summary of the bug to "utf8 Homoglyph in titles"

avarab wrote:

Moved to the general/Unknown component and changed the severity from major to
trivial, there is an easy workaround avalible.

(In reply to comment #2)

Well using non latin utf8 characters in titles is not a bug .. it's a feature.

Yes, and on those grounds I would originally suggest a WONTFIX.

(In reply to comment #3)

Moved to the general/Unknown component and changed the severity from major to
trivial, there is an easy workaround avalible.

Yes, you can use AbuseFilter to prevent these sorts of things if vandalism is indeed an issue for your wiki (and I believe en.wikipedia already does some things to this effect). For that reason, I'm going to resolve this FIXED.