Abuse filter regex \b considers unicode characters as word boundaries
Closed, DeclinedPublic

Description

Author: delbu9c1

Description:
In analyzing a false positive, I've been trying to track down the reason my regex debugger says a regex doesn't match yet it does match on the abuse filter. Eventually I found what appears to be a good lead on the issue.

Details of the incorrect match are here: http://test.wikipedia.org/w/index.php?title=Special:AbuseLog&details=1784

It appears what's going on is the é (which appears to be encoded in UTF-8) is mishandled when testing against the regex. It appears to the regex engine as a word boundary, so the match succeeds (specifically, "\brence\b" matches "conférence").

Hopefully there's a way to correct this and it's not a problem in the heart of PHP instead.

Please let me know if you need any additional information.

  • Shirik @ enwiki

Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46773

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz22761.
bzimport created this task.Via LegacyMar 8 2010, 3:28 AM
Arkanosis added a comment.Via ConduitJun 24 2010, 10:31 AM

Hello,

We have the same problem at frwiki: accented letters cause many false positives because they are recognized as word boundaries.

Best regards,

Arkanosis @ frwiki

He7d3r added a comment.Via ConduitMar 24 2011, 1:41 AM

Same problem on ptwikibooks, where a '\bpica\b' matched 'típica'.

He7d3r added a comment.Via ConduitApr 28 2011, 2:30 PM

(In reply to comment #2)

Same problem on ptwikibooks, where a '\bpica\b' matched 'típica'.

and also 'hípica' when this edit was made last week:
http://pt.wikibooks.org/?diff=prev&oldid=218088
This forced us have to use something like [^A-ZÁÀÂÃÇÉÊẼÍÓÒÔÕQ̃ÚŰÜŨ015] in a regex, so that it doesn't match accented letters

He7d3r added a comment.Via ConduitApr 28 2011, 2:32 PM

(In reply to comment #3)

[^A-ZÁÀÂÃÇÉÊẼÍÓÒÔÕQ̃ÚŰÜŨ015]

Where "015" is there as an workaround to Bug 27987.

werdna added a comment.Via ConduitFeb 10 2012, 11:49 PM

The /u flag is definitely there, as far as I can see.

public static function keywordRegex( $str, $regex, $pos, $insensitive = false ) {

		$str = $str->toString();
		$pattern = $regex->toString();

		$pattern = preg_replace( '!(\\\\\\\\)*(\\\\)?/!', '$1\/', $pattern );
		$pattern = "/$pattern/u";

		if( $insensitive ) {
			$pattern .= 'i';
		}

		$handler = new AFPRegexErrorHandler( $pattern, $pos );
		try {
			$handler->install();
			$result = preg_match( $pattern, $str );
			$handler->restore();
		} catch ( Exception $e ) {
			$handler->restore();
			throw $e;
		}
		return new AFPData( self::DBool, (bool)$result );

}

bzimport added a comment.Via ConduitJan 31 2013, 10:54 PM

delirium wrote:

Is this still happening? My understanding is that it was a PHP behavior (possibly a "bug" depending on your interpretation) that was changed in more recent versions.

My understanding of the problem:

PHP passes its regexes to PCRE, and the traditional PCRE behavior is that the "traditional" character class \w is defined as just [A-Za-z0-9_], even when matching a UTF8 string. The word boundary class \b is defined based on \w, so behaves similarly. If you wanted the Unicode notion of "letter character", there are instead Unicode character classes, such as \pL. The /u switch in PHP just tells PCRE it's matching a Unicode string, but didn't change the definition of the legacy character classes.

But, PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the traditional character classes into aliases for morally equivalent Unicode character classes. That should produce something closer to the expected behavior, at least in our case. From what I can find, this PCRE switch is enabled when /u is specified in newer versions of PHP, starting with 5.3.4, which came out in late 2010. I assume Wikimedia must be using a sufficiently new version by now?

bzimport added a comment.Via ConduitJan 31 2013, 11:01 PM

delirium wrote:

*** Bug 36129 has been marked as a duplicate of this bug. ***

Arkanosis added a comment.Via ConduitFeb 1 2013, 9:46 AM

(In reply to comment #6)

Is this still happening? My understanding is that it was a PHP behavior
(possibly a "bug" depending on your interpretation) that was changed in more
recent versions.

<snip>

PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the
traditional character classes into aliases for morally equivalent Unicode
character classes. That should produce something closer to the expected
behavior, at least in our case. From what I can find, this PCRE switch is
enabled when /u is specified in newer versions of PHP, starting with 5.3.4,
which came out in late 2010. I assume Wikimedia must be using a sufficiently
new version by now?

Indeed, I haven't managed to reproduce the bug this morning.
I'm marking the bug as RESOLVED FIXED but don't hesitate to reopen if it's not OK for you.

Thanks Mark for the detailed explanation!

Nemo_bis added a comment.Via ConduitFeb 5 2013, 10:16 AM

Surely it's not FIXED if the behaviour changed in PHP -> WORKSFORME.
WMF wikis are currently using PHP 5.3.10-1ubuntu3.4+wmf1, by the way.

Arkanosis added a comment.Via ConduitFeb 5 2013, 10:29 AM

It's fixed somehow by updating PHP (it didn't workforme back in 2010, now it does). But I guess this is of little importance.

Thanks for the info on the PHP version :)

Add Comment