Spam Blacklist shouldn't be fooled by similar-looking Unicode characters
Open, NormalPublic

Description

See the url above. By inserting a U+0EFF (.) instead of a normal dot, the user managed to link to a blacklisted site traditio.ru. And even after I attempted to fix this by explicitly adding this char to blacklist[http://meta.wikimedia.org/w/index.php?diff=863226], it does not seem to work [http://meta.wikimedia.org/w/index.php?diff=863233].


Version: unspecified
Severity: enhancement
URL: http://en.wikipedia.org/w/index.php?title=User:Afinogenoff&oldid=186525169

Details

Reference
bz12896
bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz12896.
bzimport added a subscriber: Unknown Object (MLST).
MaxSem created this task.Feb 3 2008, 6:36 PM
vvv added a comment.Feb 3 2008, 7:00 PM

Fixed in r30482

ayg wrote:

The fix seems a little narrow. What's the underlying reason that the exploit worked? U+0EFF can't be the only character that browsers will treat as a period in URLs.

vvv added a comment.Apr 24 2008, 2:01 PM

(In reply to comment #2)

The fix seems a little narrow. What's the underlying reason that the exploit
worked? U+0EFF can't be the only character that browsers will treat as a
period in URLs.

I think we need some form of UTF normalization.

Indeed, there are far more ways: http://meta.wikimedia.org/w/index.php?oldid=1319535

Unicode normalisation again.

mike.lifeguard+bugs wrote:

(In reply to comment #4)

Indeed, there are far more ways:
http://meta.wikimedia.org/w/index.php?oldid=1319535

Unicode normalisation again.

I'm not really sure what I'm supposed to be seeing at that oldid.

That said, unicode normalization is really needed. We're doing so in some monitoring tools, but of course it's also needed in the blacklist as well.

mike.lifeguard+bugs wrote:

better summary

ayg wrote:

"Unicode normalization" is a poor term to use for the problem involved here, since all the characters involved are already normalized by the definitions of the Unicode standard (they're NFC, to be precise). Adjusted summary.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2015, 4:58 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 6:19 PM
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptFeb 22 2016, 6:19 PM
C933103 added a subscriber: C933103.EditedSep 19 2018, 6:52 PM

https://zh.wikipedia.org/w/index.php?title=Wikipedia:互助客栈/技术&diff=51348491&oldid=51348410
All these characters can also be typed into Wikipedia as part of a url, bypassing blacklist, and then get accepted by browsers which would convert them into basic ascii alphanumeric characters, and then send users to the blacklisted webpage.

seth awarded a token.Nov 14 2018, 9:38 PM
seth added a subscriber: seth.

A decade-old ticket reporting a backdoor to the spam blacklist? Remarkable.

So a quick hack fix would be in includes/parser/Sanitizer.php ~ line 2047 in Sanitizer::cleanUrl() to add a line $host = UtfNormal\Validator::toNFKC( $host );. This is probably going too far though...

More correct fix should implement the algorithm in https://tools.ietf.org/html/rfc5895

Some characters, like "ß", "。", "。" from the list linked above , would not normalize to become regular ASCII character even if the NFKC normalization is applied despite they can still be identified and converted into ASCII characters by browser. And the list was not extensive thus more characters could escape the NFKC normalization. Especially note worthy thing is that the two ideographic full stop would be treat as a dot by browsers and thus it can still be used to bypass blacklist of almost all url in spam blacklist.
And then for the longer term solutions, browsers may not follow standardized way in RFCs fully and might have their own way to normalize links for users and that might need to investigate behavior of each browsers

A decade-old ticket reporting a backdoor to the spam blacklist? Remarkable.

This is nothing new, sadly. There are a lot of decade-old security bugs lying around. No team at the Foundation has enough resources to clear out its backlog, and Security is no exception.

Some characters, like "ß", "。", "。" from the list linked above , would not normalize to become regular ASCII character even if the NFKC normalization is applied despite they can still be identified and converted into ASCII characters by browser.

ß does not get normalized to ascii in IDNA2008 (It does in IDNA2003). See https://www.unicode.org/reports/tr46/tr46-21.html#Deviations


php appearently has functions to handle idn, so this becomes easier. idn_to_utf8( idn_to_ascii( $s ) ) should normalize things like we want AFAICT.

Change 475788 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Normalize IDN urls before processing blacklists

https://gerrit.wikimedia.org/r/475788

Yes that could be a browser specific conversion as my Chrome browser is converting "ß" into "ss". Then again it shows that it is necessary to look at browser implementation on normalization instead of just standards.

jrbs moved this task from Backlog to Other team on the Trust-and-Safety board.Wed, Nov 28, 11:45 PM