The Tokenizer strips backslashes from \x
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Daimona
	Nov 16 2019, 3:11 PM

Description

See code:

case 'x':
	$chr = substr( $code, $offset + 2, 2 );

	if ( preg_match( '/^[0-9A-Fa-f]{2}$/', $chr ) ) {
		$token .= chr( hexdec( $chr ) );
		// \xXX -- 2 done later
		$offset += 2;
	} else {
		$token .= 'x';
	}
	break;

It tries to parse hex codepoints in literals, and that's OK. However, when it cannot parse a hex, it will add 'x' to the string, and not '\x' as it found. This is usually equivalent, but sometimes it's not. For instance, inside regexps: see https://en.wiktionary.org/wiki/Special:AbuseFilter/106.

Details

	Subject	Repo	Branch	Lines +/-
	Tokenizer: don't strip backslashes from \x	mediawiki/extensions/AbuseFilter	master	+8 -3

Customize query in gerrit

Event Timeline

Daimona created this task.Nov 16 2019, 3:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 16 2019, 3:11 PM

Change 551312 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Tokenizer: don't strip backslashes from \x

https://gerrit.wikimedia.org/r/551312

gerritbot added a project: Patch-For-Review.Nov 16 2019, 3:17 PM

Change 551312 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Tokenizer: don't strip backslashes from \x

https://gerrit.wikimedia.org/r/551312

Daimona closed this task as Resolved.Nov 22 2019, 1:48 PM

Daimona removed a project: Patch-For-Review.

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.8; 2019-11-26).Nov 22 2019, 2:01 PM

The Tokenizer strips backslashes from \xClosed, ResolvedPublicActions

Description

Details

Event Timeline

The Tokenizer strips backslashes from \x
Closed, ResolvedPublic
Actions