Page MenuHomePhabricator

Check simple format constraints (no grouping) in PHP instead of SPARQL
Open, Stalled, MediumPublic

Description

See the parent task for a more general problem statement – in short, we need to check user-provided input against user-provided regexes, which is generally unsafe.

However, the majority of user-provided regexes we’re interested in (76%) do not contain any parentheses, which means they cannot contain any groups and their star height must be 0 or 1. Unless I’m mistaken, this should mean we can safely evaluate them via preg_match, and the evaluation time should not explode.

Since we’re currently increasing the number of constraint checks being run in general (T204031), which results in an increased query volume (T204031#4898189), I think we should consider checking such regexes in PHP, removing the overhead of talking to the query server.

Are there any objections to this? Does anyone have an example query that causes exponential runtime in PCRE without containing parentheses, or are there other security problems when checking user-provided regexes besides runtime concerns?

Event Timeline

The only thing I can think of is security issue with preg_match. Security_checklist_for_developers says:

anything external that is used in part of regex should be escaped with preg_quote( $externalStr, $delimiter ). It puts a backslash in front of every character that is part of the regular expression syntax, and escapes also the delimiter given as second parameter:

$str = preg_replace( "!" . preg_quote( $externalStr, '!' ) . "!", $replacement, $str );

Well, we can’t do that, we want the user input to be interpreted as a regex, not literally… (though we will have to escape the delimiter)

I recommend reaching out to security people and ask their opinion on this.

@JBennett can we ask for your opinion here? or someone from your wonderful team?

Addshore added a subscriber: Addshore.

Tagging the team to see if we can get some thoughts.

Jcross triaged this task as Medium priority.Oct 4 2019, 4:45 PM

@RazShuty @Addshore @Lucas_Werkmeister_WMDE - Sorry for the (very) delayed response here. Due to a healthy amount of organizational shift, the Security-Team is just now getting our Phab workboards in order to review these kinds of items more consistently. For now, it's still probably best to find one of us on IRC (most of us run bouncers or use IRCCloud) or email us if there are any lingering tasks like this.

Regarding this task, specifically:

However, the majority of user-provided regexes we’re interested in (76%) do not contain any parentheses, which means they cannot contain any groups and their star height must be 0 or 1. Unless I’m mistaken, this should mean we can safely evaluate them via preg_match, and the evaluation time should not explode.

I feel like this is a generally safe assumption to make in eliminating large swaths of likely benign regexp patterns, assuming someone doesn't create an enormously lengthy regular expression pattern which might suffer from other performance-related issues. In addition to parentheses, I'd guess that any occurrences of unbound repetition operators (+, *, {n,}) might further help to classify potentially dangerous regexp patterns. Of course static analysis to determine the safety of specific regexp patterns is fairly difficult if not impossible in certain situations, so additional solutions involving sandboxing, limiting memory usage, etc. as discussed on the parent task would certainly be encouraged as best practices by the Security-Team.

Probably blocked on the RFC around regexes.
RFC will be written in T236150

Addshore changed the task status from Open to Stalled.Nov 6 2019, 12:08 PM

I guess we can decline this task? My hope was that we could check some regular expressions locally without waiting for a safe mechanism to evaluate arbitrary regexes, but that didn’t happen. It now looks like T260330: RFC: PHP microservice for containerized shell execution is no more than a few months away (wild guess, I’m not involved with that project), which would presumably unlock checking all regular expressions without the query service (though still with some network request overhead).