Page MenuHomePhabricator

Spam Blacklist shouldn't be fooled by similar-looking Unicode characters
Open, MediumPublicFeature

Description

See the url above. By inserting a U+0EFF (.) instead of a normal dot, the user managed to link to a blacklisted site traditio.ru. And even after I attempted to fix this by explicitly adding this char to blacklist[http://meta.wikimedia.org/w/index.php?diff=863226], it does not seem to work [http://meta.wikimedia.org/w/index.php?diff=863233].


Version: unspecified
Severity: enhancement
URL: http://en.wikipedia.org/w/index.php?title=User:Afinogenoff&oldid=186525169

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:04 PM
bzimport added a project: SpamBlacklist.
bzimport set Reference to bz12896.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

The fix seems a little narrow. What's the underlying reason that the exploit worked? U+0EFF can't be the only character that browsers will treat as a period in URLs.

(In reply to comment #2)

The fix seems a little narrow. What's the underlying reason that the exploit
worked? U+0EFF can't be the only character that browsers will treat as a
period in URLs.

I think we need some form of UTF normalization.

mike.lifeguard+bugs wrote:

(In reply to comment #4)

Indeed, there are far more ways:
http://meta.wikimedia.org/w/index.php?oldid=1319535

Unicode normalisation again.

I'm not really sure what I'm supposed to be seeing at that oldid.

That said, unicode normalization is really needed. We're doing so in some monitoring tools, but of course it's also needed in the blacklist as well.

mike.lifeguard+bugs wrote:

better summary

ayg wrote:

"Unicode normalization" is a poor term to use for the problem involved here, since all the characters involved are already normalized by the definitions of the Unicode standard (they're NFC, to be precise). Adjusted summary.

https://zh.wikipedia.org/w/index.php?title=Wikipedia:互助客栈/技术&diff=51348491&oldid=51348410
All these characters can also be typed into Wikipedia as part of a url, bypassing blacklist, and then get accepted by browsers which would convert them into basic ascii alphanumeric characters, and then send users to the blacklisted webpage.

A decade-old ticket reporting a backdoor to the spam blacklist? Remarkable.

So a quick hack fix would be in includes/parser/Sanitizer.php ~ line 2047 in Sanitizer::cleanUrl() to add a line $host = UtfNormal\Validator::toNFKC( $host );. This is probably going too far though...

More correct fix should implement the algorithm in https://tools.ietf.org/html/rfc5895

Some characters, like "ß", "。", "。" from the list linked above , would not normalize to become regular ASCII character even if the NFKC normalization is applied despite they can still be identified and converted into ASCII characters by browser. And the list was not extensive thus more characters could escape the NFKC normalization. Especially note worthy thing is that the two ideographic full stop would be treat as a dot by browsers and thus it can still be used to bypass blacklist of almost all url in spam blacklist.
And then for the longer term solutions, browsers may not follow standardized way in RFCs fully and might have their own way to normalize links for users and that might need to investigate behavior of each browsers

A decade-old ticket reporting a backdoor to the spam blacklist? Remarkable.

This is nothing new, sadly. There are a lot of decade-old security bugs lying around. No team at the Foundation has enough resources to clear out its backlog, and Security is no exception.

Some characters, like "ß", "。", "。" from the list linked above , would not normalize to become regular ASCII character even if the NFKC normalization is applied despite they can still be identified and converted into ASCII characters by browser.

ß does not get normalized to ascii in IDNA2008 (It does in IDNA2003). See https://www.unicode.org/reports/tr46/tr46-21.html#Deviations


php appearently has functions to handle idn, so this becomes easier. idn_to_utf8( idn_to_ascii( $s ) ) should normalize things like we want AFAICT.

Change 475788 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Normalize IDN urls before processing blacklists

https://gerrit.wikimedia.org/r/475788

Yes that could be a browser specific conversion as my Chrome browser is converting "ß" into "ss". Then again it shows that it is necessary to look at browser implementation on normalization instead of just standards.

Change 475788 had a related patch set uploaded (by Aklapper; owner: Brian Wolff):
[mediawiki/core@master] Normalize IDN urls before processing blacklists

https://gerrit.wikimedia.org/r/475788

This makes me wonder if SpamBlacklist could utilize Equivset somehow, but maybe T14896#4774527 is sufficient?

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM