Page MenuHomePhabricator

The Tokenizer strips backslashes from \x
Closed, ResolvedPublic

Description

See code:

case 'x':
	$chr = substr( $code, $offset + 2, 2 );

	if ( preg_match( '/^[0-9A-Fa-f]{2}$/', $chr ) ) {
		$token .= chr( hexdec( $chr ) );
		// \xXX -- 2 done later
		$offset += 2;
	} else {
		$token .= 'x';
	}
	break;

It tries to parse hex codepoints in literals, and that's OK. However, when it cannot parse a hex, it will add 'x' to the string, and not '\x' as it found. This is usually equivalent, but sometimes it's not. For instance, inside regexps: see https://en.wiktionary.org/wiki/Special:AbuseFilter/106.

Event Timeline

Change 551312 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Tokenizer: don't strip backslashes from \x

https://gerrit.wikimedia.org/r/551312

Change 551312 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Tokenizer: don't strip backslashes from \x

https://gerrit.wikimedia.org/r/551312