Page MenuHomePhabricator

Simple regular expression fails on 10000 character string
Closed, DeclinedPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

Go to https://en.wikipedia.org/wiki/Special:AbuseFilter/examine/log/32266167 and test against this pattern:

new_wikitext rlike "(x)*"

(new_wikitext consists of 10000 "x"s)

What happens?:

The result is "The filter has invalid syntax"

What should have happened instead?:

new_wikitext contains lots of "x"s, so the result should be "The filter matched this change."

Event Timeline

My first thought was something about capturing groups, but it also fails on (?:x)*. I then played guess-and-check with quantifiers, and found that it works fine up till (x){0,2729}, but anything past that fails. Hmm.

There are various regex configs in PHP like pcre.backtrack_limit and pcre.recursion_limit. Perhaps they are related?

https://www.php.net/manual/en/pcre.configuration.php

The PCRE engine is reaching some memory limit. You'd get something similar in PHP itself:

var_dump(preg_match("/(x)*/ui",str_repeat('x',10000)));
var_dump(preg_last_error());

bool(false)
int(6)

Where 6 corresponds to PREG_JIT_STACKLIMIT_ERROR, which means the JIT stack would get too large.
There doesn't seem to be a way to fix it, except for setting pcre.jit = 0, which I don't think is a good idea.

The only real solution is to improve the regex somehow, perhaps split it, although I can't say how exactly without seeing the original.

Original thread was here. I'm trying to find if something occurs inside a table with an "Album" heading:

I want to say:

new_wikitext irlike ("(?s)!\s*Album(?:.(?!\|\}))*(?:" + bad_word + ")")

But that triggers this bug. I can say:

str_replace(new_wikitext, "|}", "@") irlike ("!\s*Album[^@]*(?:" + bad_word + ")")

But I was hoping to avoid such a hack.

Original thread was here. I'm trying to find if something occurs inside a table with an "Album" heading:

I want to say:

new_wikitext irlike ("(?s)!\s*Album(?:.(?!\|\}))*(?:" + bad_word + ")")

But that triggers this bug. I can say:

str_replace(new_wikitext, "|}", "@") irlike ("!\s*Album[^@]*(?:" + bad_word + ")")

But I was hoping to avoid such a hack.

One thing you could try is replacing (?:.(?!\|\}))* with (?:[^|]|\|(?!\}))*, which seems cheaper. And also adding a limit to the quantifier (e.g. {0,3000} or some other number), which should help with performance even when the memory limit isn't reached.

I'm declining this task because it's not possible to fix, but do feel free to task if you need more help with the regex.