Page MenuHomePhabricator

HHVM and PCRE v8.31 gives incorrect results for certain PCRE patterns
Closed, ResolvedPublic

Description

The following call

preg_match( '/[^\p{S}\p{Nd}]/us', '4' )

correctly returns 0 on Zend but returns 1 under HHVM on MediaWiki appservers such as mw1017. It does not seem to occur on osmium.

The same occurs with various other permutations:

preg_match( '/[^\p{P}\p{Nd}]/us', '4' )
preg_match( '/[^\p{Z}\p{Nd}]/us', '4' )
preg_match( '/[^\p{Xps}\p{Nd}]/us', '4' )
preg_match( '/[^\p{Xsp}\p{Nd}]/us', '4' )
preg_match( '/[^\p{M}\p{Ll}]/us', 'a' )
preg_match( '/[^\p{N}\p{Ll}]/us', 'a' )
preg_match( '/[^\p{P}\p{Ll}]/us', 'a' )
preg_match( '/[^\p{S}\p{Ll}]/us', 'a' )
preg_match( '/[^\p{Z}\p{Ll}]/us', 'a' )
preg_match( '/[^\p{Xps}\p{Ll}]/us', 'a' )
preg_match( '/[^\p{Xsp}\p{Ll}]/us', 'a' )
preg_match( '/[^\p{M}\p{Lu}]/us', 'A' )
preg_match( '/[^\p{N}\p{Lu}]/us', 'A' )
preg_match( '/[^\p{P}\p{Lu}]/us', 'A' )
preg_match( '/[^\p{S}\p{Lu}]/us', 'A' )
preg_match( '/[^\p{Z}\p{Lu}]/us', 'A' )
preg_match( '/[^\p{Xps}\p{Lu}]/us', 'A' )
preg_match( '/[^\p{Xsp}\p{Lu}]/us', 'A' )

Version: wmf-deployment
Severity: normal

Details

Reference
bz71922

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:45 AM
bzimport set Reference to bz71922.
bzimport added a subscriber: Unknown Object (MLST).
hashar updated the task description. (Show Details)

This isn't obviously related to the PCRE caching issue, since it happens when the cache is far from full. As in literally php -r "var_dump(preg_match( '/[^\p{S}\p{Nd}]/us', '4' ));" shows the bug if /usr/bin/php is a symlink to /usr/bin/hhvm.

No idea whether it's actually related, but here's another case that gives an incorrect result in HHVM while working fine in Zend: php -r 'var_dump( preg_match( "/[\\x{101}-\\x{102}\\x{100}]/us", "\xC4\x80" ) );' (note that \xC4\x80 is U+0100) will correctly return 1 in Zend but returns 0 in HHVM. Adding an ^ into the character set will correctly return 0 in Zend but returns 1 in HHVM.

The failure conditions for this one seem to be:

  • There is a range before a single character in the character set.
  • The single character's codepoint is less than the codepoints for the range.
  • The single character has codepoint U+0100 or greater.

If these conditions are met, the single character will not be matched by the character set (or will be incorrectly matched if the character set is negated).

Printing out the PCRE_VERSION constant shows the same result when using Zend and HHVM, so it doesn't seem to be an issue with different versions of PCRE either.

After some further testing, it is a PCRE bug, and it looks like it was fixed in 8.32 (we're using 8.31). The reason it only occurs in HHVM is because HHVM passes PCRE_STUDY_JIT_COMPILE to pcre_study while Zend (at least in 5.5.9) doesn't. I don't see a specific upstream change that fixes it, they seem to have made a lot of JIT changes between 8.31 and 8.32.

So the simplest fix would be to upgrade libpcre3 (HHVM seems to be dynamically linked). Let's do at least 8.33, because 8.32 has a different bug (T71481). Debian stable and Ubuntu vivid are both on 8.35.

Anomie renamed this task from HHVM gives incorrect results for certain PCRE patterns to HHVM and PCRE v8.31 gives incorrect results for certain PCRE patterns.Nov 11 2015, 6:08 PM
MoritzMuehlenhoff claimed this task.

We're using Debian jessie for a while now (which has PCRE 8.35) and I've verified that hhvm now also correctly emits 0 in the provided test case.