Cannot create a username containing a Zero width joiner on languages where a ZWJ makes a visible difference and is required
OpenPublic

Description

Author: rathne

Description:
Hi,

We are having some issues with creating users in Sinhala Wikipedia. We are not allowed to create user names like "සසීන්ද්‍ර" and "නන්දිමිත්‍ර".

This looks like something to do with modifiers on Sinhala letters. May be zero width joiner (ZWJ) need to be allowed?

rakaranshaya (්‍ර) is written as: hal kereema + zero width joiner(ZWJ) + ra

Thanks in advance,
/Lee


Version: unspecified
Severity: enhancement

bzimport set Reference to bz24999.
bzimport created this task.Via LegacyAug 31 2010, 2:59 PM
Platonides added a comment.Via ConduitAug 31 2010, 7:24 PM

Zero width joiner is forbidden from appearing in a username character since r13007.

Perhaps we could allow it if surrounded by Sinhala characters? :s

bzimport added a comment.Via ConduitSep 1 2010, 3:37 AM

rathne wrote:

Is there any reason why these characters are black listed?

Platonides added a comment.Via ConduitSep 1 2010, 10:24 PM

If you have a user called "Some admin", having another account called "Some admin" but using a non-default space is confusing. Moreover, trying to block the vandal you are likely to block the right user (or be unable to, if the account with normal space didn't exists).

I think that's what Brion referred as 'troublemaker characters'.

On the other hand, the request to use නන්දිමිත්‍ර is perfectly reasonable.

Catrope added a comment.Via ConduitSep 12 2010, 3:26 PM

(In reply to comment #3)

If you have a user called "Some admin", having another account called "Some
admin" but using a non-default space is confusing. Moreover, trying to block
the vandal you are likely to block the right user (or be unable to, if the
account with normal space didn't exists).

Don't we have Extension:AntiSpoof for this?

I think that's what Brion referred as 'troublemaker characters'.

On the other hand, the request to use නන්දිමිත්‍ර is perfectly reasonable.

Most of the banned characters (whitespace, nbsp, control chars) do look like troublemakers, but ZWJ seems perfectly reasonable to me.

Platonides added a comment.Via ConduitSep 12 2010, 9:55 PM

Don't we have Extension:AntiSpoof for this?

Antispoof is more powerful: checks similar characters, blocks mixed scripts...

Most of the banned characters (whitespace, nbsp, control chars) do look like
troublemakers, but ZWJ seems perfectly reasonable to me.

Are you sure? Please compare in your browser [[User:Catrope]] vs [[User:Cat‍rope]]. There's no visual difference in mine.

Catrope added a comment.Via ConduitSep 13 2010, 4:54 PM

(In reply to comment #5)

Are you sure? Please compare in your browser [[User:Catrope]] vs
[[User:Cat‍rope]]. There's no visual difference in mine.

That's what we have AntiSpoof for, right? I'm sure there's plenty of characters that look very much like an ASCII 'C'.

Platonides added a comment.Via ConduitSep 13 2010, 10:39 PM

You failed. That C is the normal one.
What I did was inserting a ZWJ between Cat and rope.

Catrope added a comment.Via ConduitSep 14 2010, 12:13 PM

(In reply to comment #7)

You failed. That C is the normal one.
What I did was inserting a ZWJ between Cat and rope.

I knew that, I was just pointing out there's other ways to construct a username looking just like 'Catrope' without using ZWJs or other characters currently forbidden in usernames.

Platonides added a comment.Via ConduitSep 14 2010, 12:45 PM

Sure you could use [[С]] for writing [[User:Сatrope]], and that would be blocked by AntiSpoof.
The point is, ZWJ should not be allowed in usernames unless the bad usage keeps blocked.

bzimport added a comment.Via ConduitDec 10 2010, 2:51 PM

rathne wrote:

Do we have any update on this?

Platonides added a comment.Via ConduitDec 11 2010, 10:03 PM

Lee, can you figure out in which cases a ZWJ makes a visual difference?
I think that's the blocker here. If we can isolate some unambiguous instances of ZWJ, we could try whitelisting them.

Bawolff added a comment.Via ConduitDec 12 2010, 5:24 AM

According to wikipedia, that'd be arabic and most indic scripts have at least some characters where it makes a visual difference.

Googling, http://www.unicode.org/reports/tr31/ (section 2.3) seems to have some advice on when and when not to ban ZWJ. (it even gives perl regexes, but using the fancy stuff that I don't think is supported by pcre)

http://unicode.org/review/pr-96.html also seems to have some advice (and seems more down to the point), but its unclear what the status of that document is.

bzimport added a comment.Via ConduitDec 17 2010, 12:55 PM

rathne wrote:

It looks like I'm going to need some help to answer that question. I'm not that expert in the language. I'll ask around so someone with the proper knowledge can help here.

santhosh added a comment.Via ConduitSep 6 2011, 3:37 PM

According to Unicode Annex 31(http://www.unicode.org/reports/tr31/), Identifier patterns, as an exception to the usual exclusion of ZWJ is not allowed for certain scripts. That includes Sinhala. But the policy is strict about where and how one can use ZWJ.
Sinhala , many Indian languages and Arabix require zwj, which make visual difference.
We need to implement UAX31 on top of r13007

santhosh added a comment.Via ConduitSep 6 2011, 3:40 PM

(In reply to comment #14)

According to Unicode Annex 31(http://www.unicode.org/reports/t0/), Identifier
patterns, as an exception to the usual exclusion of ZWJ is not allowed for
certain scripts.

Sorry. Read it as :

According to Unicode Annex 31(http://www.unicode.org/reports/t0/), Identifier patterns, as an exception to the usual exclusio, ZWJ *is allowed* for certain scripts,

Platonides added a comment.Via ConduitSep 15 2011, 8:53 PM

The right url seems to be http://unicode.org/reports/tr31/
There are some regular expressions reported, I think they are based on \L{} (Unicode properties). Luckily, we can do some slow things on this path.

Bawolff added a comment.Via ConduitSep 15 2011, 8:58 PM

(In reply to comment #16)

The right url seems to be http://unicode.org/reports/t��0��/
There are some regular expressions reported, I think they are based on \L{}
(Unicode properties). Luckily, we can do some slow things on this path.

I think the rXXX in the url is screwing it up with magic revision auto-linking. Lets try http://www.unicode.org/reports/t%7231/

Last time I looked at that page, the regexs used things based on the more complex unicode properties supported by perl but not pcre. However it was still very do-able, one just needed to create a fairly large (not huge though) character class by hand.

santhosh placed this task up for grabs.Via WebNov 25 2014, 4:38 AM
santhosh set Security to None.

Add Comment