Page MenuHomePhabricator

Check simple format constraints (no grouping) in PHP instead of SPARQL
Open, MediumPublic

Description

In the WikibaseQualityConstraints extension, we need to check user-provided input against user-provided regexes, which is generally unsafe without some sort of protection. (See the parent task for some more details.)

Currently, this is implemented using Shellbox (with a fairly short timeout). This is already an improvement over the previous implementation, which used the Wikidata Query Service. However, both of these solutions still add network overhead to the check, making it one of the slowest constraint checks.

The majority of user-provided regexes we’re interested in (77%) do not contain any parentheses, which means they cannot contain any groups and their star height must be 0 or 1. Unless I’m mistaken, this should mean we can safely evaluate them via preg_match (directly in the main PHP process, without Shellbox), and the evaluation time should not explode.

Are there any objections to this? Does anyone have an example query that causes exponential runtime in PCRE without containing parentheses, or are there other security problems when checking user-provided regexes besides runtime concerns?

Event Timeline

The only thing I can think of is security issue with preg_match. Security_checklist_for_developers says:

anything external that is used in part of regex should be escaped with preg_quote( $externalStr, $delimiter ). It puts a backslash in front of every character that is part of the regular expression syntax, and escapes also the delimiter given as second parameter:

$str = preg_replace( "!" . preg_quote( $externalStr, '!' ) . "!", $replacement, $str );

Well, we can’t do that, we want the user input to be interpreted as a regex, not literally… (though we will have to escape the delimiter)

I recommend reaching out to security people and ask their opinion on this.

@JBennett can we ask for your opinion here? or someone from your wonderful team?

Addshore subscribed.

Tagging the team to see if we can get some thoughts.

Jcross triaged this task as Medium priority.Oct 4 2019, 4:45 PM

@RazShuty @Addshore @Lucas_Werkmeister_WMDE - Sorry for the (very) delayed response here. Due to a healthy amount of organizational shift, the Security-Team is just now getting our Phab workboards in order to review these kinds of items more consistently. For now, it's still probably best to find one of us on IRC (most of us run bouncers or use IRCCloud) or email us if there are any lingering tasks like this.

Regarding this task, specifically:

However, the majority of user-provided regexes we’re interested in (76%) do not contain any parentheses, which means they cannot contain any groups and their star height must be 0 or 1. Unless I’m mistaken, this should mean we can safely evaluate them via preg_match, and the evaluation time should not explode.

I feel like this is a generally safe assumption to make in eliminating large swaths of likely benign regexp patterns, assuming someone doesn't create an enormously lengthy regular expression pattern which might suffer from other performance-related issues. In addition to parentheses, I'd guess that any occurrences of unbound repetition operators (+, *, {n,}) might further help to classify potentially dangerous regexp patterns. Of course static analysis to determine the safety of specific regexp patterns is fairly difficult if not impossible in certain situations, so additional solutions involving sandboxing, limiting memory usage, etc. as discussed on the parent task would certainly be encouraged as best practices by the Security-Team.

Probably blocked on the RFC around regexes.
RFC will be written in T236150

Addshore changed the task status from Open to Stalled.Nov 6 2019, 12:08 PM

I guess we can decline this task? My hope was that we could check some regular expressions locally without waiting for a safe mechanism to evaluate arbitrary regexes, but that didn’t happen. It now looks like T260330: RFC: PHP microservice for containerized shell execution is no more than a few months away (wild guess, I’m not involved with that project), which would presumably unlock checking all regular expressions without the query service (though still with some network request overhead).

In T176312#7269680 it was concluded that this task here is still worth looking into.

Addshore changed the task status from Stalled to Open.Sep 14 2021, 8:31 AM

No longer stalled then I guess!

@Lucas_Werkmeister_WMDE it looks like the description is probably quite out of date in terms of the current state and what we may actually want to do here (it mainly point to and talks about the parent ticket which is now closed).
Any chance you could give it a go re writing it?

I’ve refreshed the task description a bit, hope it helps.