Spam Blacklist shouldn't be fooled by similar-looking Unicode characters
Open, MediumPublicFeature
Actions

Assigned To

None

Authored By

	MaxSem
	Feb 3 2008, 6:36 PM

Description

See the url above. By inserting a U+0EFF (．) instead of a normal dot, the user managed to link to a blacklisted site traditio.ru. And even after I attempted to fix this by explicitly adding this char to blacklist[http://meta.wikimedia.org/w/index.php?diff=863226], it does not seem to work [http://meta.wikimedia.org/w/index.php?diff=863233].

Version: unspecified
Severity: enhancement
URL: http://en.wikipedia.org/w/index.php?title=User:Afinogenoff&oldid=186525169

Details

Reference: bz12896

	Subject	Repo	Branch	Lines +/-
	Normalize IDN urls before processing blacklists	mediawiki/core	master	+14 -3

Customize query in gerrit

Related Objects

Mentioned In: T250927: Extension SpamBlacklist does not support Unicode

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:04 PM

• bzimport added a project: SpamBlacklist.

• bzimport set Reference to bz12896.

• bzimport added a subscriber: Unknown Object (MLST).

MaxSem created this task.Feb 3 2008, 6:36 PM

Fixed in r30482

ayg wrote:

The fix seems a little narrow. What's the underlying reason that the exploit worked? U+0EFF can't be the only character that browsers will treat as a period in URLs.

(In reply to comment #2)

The fix seems a little narrow. What's the underlying reason that the exploit
worked? U+0EFF can't be the only character that browsers will treat as a
period in URLs.

I think we need some form of UTF normalization.

Indeed, there are far more ways: http://meta.wikimedia.org/w/index.php?oldid=1319535

Unicode normalisation again.

mike.lifeguard+bugs wrote:

(In reply to comment #4)

Indeed, there are far more ways:
http://meta.wikimedia.org/w/index.php?oldid=1319535

Unicode normalisation again.

I'm not really sure what I'm supposed to be seeing at that oldid.

That said, unicode normalization is really needed. We're doing so in some monitoring tools, but of course it's also needed in the blacklist as well.

mike.lifeguard+bugs wrote:

better summary

ayg wrote:

"Unicode normalization" is a poor term to use for the problem involved here, since all the characters involved are already normalized by the definitions of the Unicode standard (they're NFC, to be precise). Adjusted summary.

Legoktm added a project: Stewards-and-global-tools.Oct 27 2015, 4:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2015, 4:58 PM

Meno25 unsubscribed.Feb 22 2016, 6:19 PM

Restricted Application added a subscriber: JEumerus. · View Herald TranscriptFeb 22 2016, 6:19 PM

MarcoAurelio removed a subscriber: • wikibugs-l-list.Apr 18 2016, 9:19 AM

MarcoAurelio removed a parent task: T43492: [DO NOT USE] Steward, global sysop and SWMT tasks bugs (tracking) [superseded by #Stewards-and-global-tools].Jan 11 2017, 11:40 PM

Matanya moved this task from Untriaged to Medium priority on the Stewards-and-global-tools board.Apr 19 2017, 8:43 PM

• IZoid added a subtask: T49578: Score should output SVG.Dec 9 2017, 6:34 PM

Ebe123 removed a subtask: T49578: Score should output SVG.Dec 9 2017, 6:37 PM

https://zh.wikipedia.org/w/index.php?title=Wikipedia:互助客栈/技术&diff=51348491&oldid=51348410
All these characters can also be typed into Wikipedia as part of a url, bypassing blacklist, and then get accepted by browsers which would convert them into basic ascii alphanumeric characters, and then send users to the blacklisted webpage.

Beetstra subscribed.Oct 30 2018, 8:39 PM

Liuxinyu970226 awarded a token.Nov 11 2018, 6:47 AM

Liuxinyu970226 subscribed.

seth awarded a token.Nov 14 2018, 9:38 PM

seth subscribed.

MaxSem merged a task: T208725: Using IDN character mapping bypasses URL blacklists.Nov 25 2018, 7:49 AM

MaxSem added a subscriber: Pigsonthewing.

A decade-old ticket reporting a backdoor to the spam blacklist? Remarkable.

So a quick hack fix would be in includes/parser/Sanitizer.php ~ line 2047 in Sanitizer::cleanUrl() to add a line $host = UtfNormal\Validator::toNFKC( $host );. This is probably going too far though...

More correct fix should implement the algorithm in https://tools.ietf.org/html/rfc5895

Some characters, like "ß", "。", "｡" from the list linked above , would not normalize to become regular ASCII character even if the NFKC normalization is applied despite they can still be identified and converted into ASCII characters by browser. And the list was not extensive thus more characters could escape the NFKC normalization. Especially note worthy thing is that the two ideographic full stop would be treat as a dot by browsers and thus it can still be used to bypass blacklist of almost all url in spam blacklist.
And then for the longer term solutions, browsers may not follow standardized way in RFCs fully and might have their own way to normalize links for users and that might need to investigate behavior of each browsers

In T14896#4772335, @Pigsonthewing wrote:

A decade-old ticket reporting a backdoor to the spam blacklist? Remarkable.

This is nothing new, sadly. There are a lot of decade-old security bugs lying around. No team at the Foundation has enough resources to clear out its backlog, and Security is no exception.

In T14896#4772866, @C933103 wrote:

Some characters, like "ß", "。", "｡" from the list linked above , would not normalize to become regular ASCII character even if the NFKC normalization is applied despite they can still be identified and converted into ASCII characters by browser.

ß does not get normalized to ascii in IDNA2008 (It does in IDNA2003). See https://www.unicode.org/reports/tr46/tr46-21.html#Deviations

php appearently has functions to handle idn, so this becomes easier. idn_to_utf8( idn_to_ascii( $s ) ) should normalize things like we want AFAICT.

Change 475788 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Normalize IDN urls before processing blacklists

https://gerrit.wikimedia.org/r/475788

gerritbot added a project: Patch-For-Review.Nov 26 2018, 5:06 PM

Xaosflux subscribed.Nov 26 2018, 7:19 PM

Yes that could be a browser specific conversion as my Chrome browser is converting "ß" into "ss". Then again it shows that it is necessary to look at browser implementation on normalization instead of just standards.

jrbs moved this task from Backlog to Other team on the Trust-and-Safety board.Nov 28 2018, 11:45 PM

AfroThundr3007730 subscribed.Dec 1 2018, 9:57 PM

Aklapper mentioned this in T250927: Extension SpamBlacklist does not support Unicode.Apr 23 2020, 7:48 AM

Change 475788 had a related patch set uploaded (by Aklapper; owner: Brian Wolff):
[mediawiki/core@master] Normalize IDN urls before processing blacklists

https://gerrit.wikimedia.org/r/475788

This makes me wonder if SpamBlacklist could utilize Equivset somehow, but maybe T14896#4774527 is sufficient?

DannyS712 subscribed.Apr 23 2020, 8:25 AM

sbassett merged a task: Restricted Task.Nov 23 2020, 4:16 PM

sbassett added subscribers: PerfektesChaos, Base, SpamBlacklist.

JJMC89 removed a subscriber: SpamBlacklist.Nov 23 2020, 4:34 PM

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM

Pppery edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Apr 2 2023, 12:48 AM

Spam Blacklist shouldn't be fooled by similar-looking Unicode charactersOpen, MediumPublicFeatureActions

Description

Details

Related Objects

Event Timeline

Spam Blacklist shouldn't be fooled by similar-looking Unicode characters
Open, MediumPublicFeature
Actions