In T202058 the problem is that preg_quote was not used with a second argument (the regex delimiter).
This could be detected by phpcs to avoid such usages which mostly result in problems with unescaped regex
In T202058 the problem is that preg_quote was not used with a second argument (the regex delimiter).
This could be detected by phpcs to avoid such usages which mostly result in problems with unescaped regex
Change 456896 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/tools/codesniffer@master] preg_quote should not be used with 1 argument
The expaned sniff does not detect the function at location where callables are allowed:
array_map( 'preg_quote', array_keys( $config[ApiBase::PARAM_TEMPLATE_VARS] ) )
But that is for all functions which are forbidden in ForbiddenFunctionsSniff.php, maybe that needs an own task to improve. But when the callable is part of a variable it is impossible to detect.
Change 456896 merged by jenkins-bot:
[mediawiki/tools/codesniffer@master] preg_quote should not be used with 1 argument
Oh, wait. Using preg_quote with no argument is entirely valid:
$regex = preg_quote( $searchString ); preg_match( '{\b' . $regex . '\b}', $text, … );
Regular expressions are typically enclosed in /…/. But you can not only use any character you like, you can also use brackets. If that's the case, no additional character needs to be escaped when using preg_quote.
Change 457085 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/tools/codesniffer@master] Revert "preg_quote should not be used with 1 argument"
Change 457085 merged by jenkins-bot:
[mediawiki/tools/codesniffer@master] Revert "preg_quote should not be used with 1 argument"
I have never seen brackets as delimiter, but it is okay from documentation http://php.net/manual/de/regexp.reference.delimiters.php
and such brackets are already escaped by `preg_quote
`
With codesniffer it is not possible to detect the context of preg_quote, so this needs a phan plugin or the phan-taint-check-plugin needs a new category to check user supplied input to regex. Phan can also see the call in array_map
I see two paths forward:
Personally, I think it's worth adding one parameter to a few preg_quote calls, for the reward of using regular expressions in a way most developers are familiar with. Regular expressions are confusing enough as they are. I think we'd need a convincing argument to go otherwise, but I'm listening :)
Which way do we want to go, and why?
/…/ is the standard, and used in the majority of cases – unless there is a good reason to free the slash from being a special character. A typical example is something like /^http:\/\/\w+\/\w+\//, which is painful and often written as !^http://\w+/\w+/! instead. Alternative delimiters I see quite often in real-life code are !…! and @…@. I assume this is mostly because most developers don't know about the possibility to use {…}. Note this still allows to use the brackets as quantifiers, e.g. {1,2}, which is why I think this is the (almost, because it's uncommon) perfect delimiter.
But: disallowing /…/ would be weird and disturbing, not gain us much, and be pretty much impossible to auto-fix, as far as I can see. Think of compiled regular expressions: how can an auto-fix find the first and last character then?
I think this is just not worth it.
Sure, we could still enforce the second parameter in preg_quote to be '/', even if brackets are used as delimiters. Having slashes escaped as \/ just to be sure is fine, even if the slash does not necessarily have a special meaning.
So to answer the question above: Neither nor.
I agree with this. See also https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome.
I assume this is mostly because most developers don't know about the possibility to use {…}. Note this still allows to use the brackets as quantifiers, e.g. {1,2}, which is why I think this is the (almost, because it's uncommon) perfect delimiter.
I tend to avoid {...} because I always have to pause to think whether it works properly with those quantifiers. If the regex isn't trying to deal with HTML I sometimes use <...>.
But: disallowing /…/ would be weird and disturbing, not gain us much, and be pretty much impossible to auto-fix, as far as I can see. Think of compiled regular expressions: how can an auto-fix find the first and last character then?
I think this is just not worth it.
I agree here too.
I know phan can track strings through various calls for whether they've been HTML-escaped or not at the point where they're output. Can it do the same for preg_match and such, to determine whether any preg_quote calls that went into the string included the string's delimiter?
I think that would be confusing. In such event I'd rather see an inline phpcs-disable rule then a second parameter that looks like a mistake.
I'm also not sure whether we really need to enforce the delimiter to be /. I was under the impression our goal was to require a (any) delimiter to be set to preg_quote. This would mean that if we use #, or ! instead for a particular regular expression (as somewhat common convenient non-slash alternatives), you'd pass that character as second parameter to preg_quote, instead.
Alternatively, given preq_quote is only used for the rare subset of regex matching where foreign input is embedded as literal, we could require those to always use a slash, and then only allow !or # for literals. That seems like it would cover most if not all our cases, but I wasn't actually going that far yet. Thoughts?
Note that preg_quote quotes ! by default, which might explain its use as a delimiter. Also, just because a regex happens to include preg_quote doesn't mean that leaning toothpick syndrome won't apply to other parts of the regex.
The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
From http://www.php.net/manual/en/function.preg-quote.php
Using the taint-check-plugin to ensure that user provided input is escaped before use inside a regex (preg_*) sounds for the best, when it is possible to ensure the correct delimiter with phan. Regex injection should be avoid similar to sql injection.
Ensure that preg_quote is called with a second parameter with codesniffer is not useful, when due to copy&paste mistake the wrong delimiter is used.
FWIW, Personally, I dislike the idea of using brackets as a delimiter. That seems confusing.
Using the taint-check-plugin to ensure that user provided input is escaped before use inside a regex (preg_*) sounds for the best, when it is possible to ensure the correct delimiter with phan. Regex injection should be avoid similar to sql injection.
phan-taint-check can definitely check if user input is provided unescaped, but matching up delimiter to what is used with preg_quote might be a little harder.
taint check has ReDoS checks since https://gerrit.wikimedia.org/r/c/mediawiki/tools/phan/SecurityCheckPlugin/+/644800 - that should be enough