Page MenuHomePhabricator

Add sniff that preg_quote should not be used with 1 argument
Open, MediumPublic

Description

In T202058 the problem is that preg_quote was not used with a second argument (the regex delimiter).

This could be detected by phpcs to avoid such usages which mostly result in problems with unescaped regex

Event Timeline

Change 456896 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/tools/codesniffer@master] preg_quote should not be used with 1 argument

https://gerrit.wikimedia.org/r/456896

The expaned sniff does not detect the function at location where callables are allowed:

array_map( 'preg_quote', array_keys( $config[ApiBase::PARAM_TEMPLATE_VARS] ) )

But that is for all functions which are forbidden in ForbiddenFunctionsSniff.php, maybe that needs an own task to improve. But when the callable is part of a variable it is impossible to detect.

Change 456896 merged by jenkins-bot:
[mediawiki/tools/codesniffer@master] preg_quote should not be used with 1 argument

https://gerrit.wikimedia.org/r/456896

Legoktm assigned this task to Umherirrender.
thiemowmde added a subscriber: thiemowmde.

Oh, wait. Using preg_quote with no argument is entirely valid:

$regex = preg_quote( $searchString );
preg_match( '{\b' . $regex . '\b}', $text,  );

Regular expressions are typically enclosed in /…/. But you can not only use any character you like, you can also use brackets. If that's the case, no additional character needs to be escaped when using preg_quote.

Change 457085 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/tools/codesniffer@master] Revert "preg_quote should not be used with 1 argument"

https://gerrit.wikimedia.org/r/457085

Change 457085 merged by jenkins-bot:
[mediawiki/tools/codesniffer@master] Revert "preg_quote should not be used with 1 argument"

https://gerrit.wikimedia.org/r/457085

I have never seen brackets as delimiter, but it is okay from documentation http://php.net/manual/de/regexp.reference.delimiters.php
and such brackets are already escaped by `preg_quote
`
With codesniffer it is not possible to detect the context of preg_quote, so this needs a phan plugin or the phan-taint-check-plugin needs a new category to check user supplied input to regex. Phan can also see the call in array_map

Oh, wait. Using preg_quote with no argument is entirely valid:

$regex = preg_quote( $searchString );
preg_match( '{\b' . $regex . '\b}', $text,  );

Regular expressions are typically enclosed in /…/. But you can not only use any character you like, you can also use brackets.

I see two paths forward:

  1. We agree to use the slash-delimited form of regular expressions in our PHP code. This is the status quo. The proposed rule to require a delimiter enforces that. There are indeed cases where preg_quote can be safe, but it is by no means functionally required to do so. Any code using it without slashes can be updated for consistency to use slashes and be fine.
  2. We agree on a different delimiter, and update all existing code. If the chosen delimiter does not require preg_quote, we will also update the sniff to disallow use of the second parameter.

Personally, I think it's worth adding one parameter to a few preg_quote calls, for the reward of using regular expressions in a way most developers are familiar with. Regular expressions are confusing enough as they are. I think we'd need a convincing argument to go otherwise, but I'm listening :)

Which way do we want to go, and why?

Krinkle triaged this task as Medium priority.Sep 11 2018, 12:02 AM
Krinkle added a subscriber: tstarling.

/…/ is the standard, and used in the majority of cases – unless there is a good reason to free the slash from being a special character. A typical example is something like /^http:\/\/\w+\/\w+\//, which is painful and often written as !^http://\w+/\w+/! instead. Alternative delimiters I see quite often in real-life code are !…! and @…@. I assume this is mostly because most developers don't know about the possibility to use {…}. Note this still allows to use the brackets as quantifiers, e.g. {1,2}, which is why I think this is the (almost, because it's uncommon) perfect delimiter.

But: disallowing /…/ would be weird and disturbing, not gain us much, and be pretty much impossible to auto-fix, as far as I can see. Think of compiled regular expressions: how can an auto-fix find the first and last character then?

I think this is just not worth it.

Sure, we could still enforce the second parameter in preg_quote to be '/', even if brackets are used as delimiters. Having slashes escaped as \/ just to be sure is fine, even if the slash does not necessarily have a special meaning.

So to answer the question above: Neither nor.

/…/ is the standard, and used in the majority of cases – unless there is a good reason to free the slash from being a special character. A typical example is something like /^http:\/\/\w+\/\w+\//, which is painful and often written as !^http://\w+/\w+/! instead.

I agree with this. See also https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome.

I assume this is mostly because most developers don't know about the possibility to use {…}. Note this still allows to use the brackets as quantifiers, e.g. {1,2}, which is why I think this is the (almost, because it's uncommon) perfect delimiter.

I tend to avoid {...} because I always have to pause to think whether it works properly with those quantifiers. If the regex isn't trying to deal with HTML I sometimes use <...>.

But: disallowing /…/ would be weird and disturbing, not gain us much, and be pretty much impossible to auto-fix, as far as I can see. Think of compiled regular expressions: how can an auto-fix find the first and last character then?

I think this is just not worth it.

I agree here too.

I know phan can track strings through various calls for whether they've been HTML-escaped or not at the point where they're output. Can it do the same for preg_match and such, to determine whether any preg_quote calls that went into the string included the string's delimiter?

Sure, we could still enforce the second parameter in preg_quote to be '/', even if brackets are used as delimiters. [..]

I think that would be confusing. In such event I'd rather see an inline phpcs-disable rule then a second parameter that looks like a mistake.

I'm also not sure whether we really need to enforce the delimiter to be /. I was under the impression our goal was to require a (any) delimiter to be set to preg_quote. This would mean that if we use #, or ! instead for a particular regular expression (as somewhat common convenient non-slash alternatives), you'd pass that character as second parameter to preg_quote, instead.

Alternatively, given preq_quote is only used for the rare subset of regex matching where foreign input is embedded as literal, we could require those to always use a slash, and then only allow !or # for literals. That seems like it would cover most if not all our cases, but I wasn't actually going that far yet. Thoughts?

Note that preg_quote quotes ! by default, which might explain its use as a delimiter. Also, just because a regex happens to include preg_quote doesn't mean that leaning toothpick syndrome won't apply to other parts of the regex.

The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
From http://www.php.net/manual/en/function.preg-quote.php

Using the taint-check-plugin to ensure that user provided input is escaped before use inside a regex (preg_*) sounds for the best, when it is possible to ensure the correct delimiter with phan. Regex injection should be avoid similar to sql injection.

Ensure that preg_quote is called with a second parameter with codesniffer is not useful, when due to copy&paste mistake the wrong delimiter is used.

FWIW, Personally, I dislike the idea of using brackets as a delimiter. That seems confusing.

Using the taint-check-plugin to ensure that user provided input is escaped before use inside a regex (preg_*) sounds for the best, when it is possible to ensure the correct delimiter with phan. Regex injection should be avoid similar to sql injection.

phan-taint-check can definitely check if user input is provided unescaped, but matching up delimiter to what is used with preg_quote might be a little harder.