Page MenuHomePhabricator

Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions)
Open, Needs TriagePublic

Description

Problem

Wikidata validates statements against the constraints from the associated property. The constraints are user-defined and one of the possible constraint types for text values is a regex pattern.

Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in some way.

The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.

@Krinkle wrote in T173696:

I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox?

We can’t directly uses Lua’s regexes because their syntax is too different from PCRE, and implementig a PCRE-compatible engine in Lua is probably too much work. I’m not familiar enough with firejail to comment on that option.

Solutions

  • Lua (via PHP binding?)
  • PHP program called as sub process within a Firejail.
  • re2 (https://github.com/google/re2), either via a microservice, or via PHP binding.
  • ...

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2017, 12:33 PM

For the wikidata-constraints test system (which doesn’t have enough RAM to run BlazeGraph), I’ve written a tiny server, minisparql, which listens for REGEX() SPARQL queries and evaluates them with PCRE. We could use a very similar service, except we can do away with the SPARQL wrapping (e. g. have the regex and text on two lines), and then we can write the whole thing in PHP (I thought the regex-parsing part would be easier to do in Python, but then I learned that Python doesn’t have built-in PCRE, so currently minisparql has Python run a PHP process just for preg_match, which is of course a bit silly).

minisparql is a systemd service and uses several systemd features:

  • socket activation
    • minisparql doesn’t have to handle any network stuff, systemd sets up the socket
    • input on stdin and output on stdout
    • one process per connection, so the sandbox (see below) is per-request automatically
  • sandboxing
    • max. 1 second of CPU time
    • only one extra process (the PHP, mentioned before; could be lowered to zero extra processes or threads when the whole thing is written in PHP)
    • running as nobody, no writing to the file system, no own networking, no sudo, etc.

Could we deploy a service like this on the application servers so that there’s no networking overhead?

daniel renamed this task from Don’t check format constraint via SPARQL to Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions).Sep 20 2017, 7:47 PM
Joe added a subscriber: Joe.Sep 20 2017, 8:12 PM
daniel moved this task from Inbox to Backlog on the TechCom-RFC board.Sep 20 2017, 8:41 PM

If you just want an approximately PCRE-like syntax, you could just translate the regex to a Lua pattern. Scribunto has equivalent code going in the other direction, in Scribunto_LuaUstringLibrary::patternToRegex(), which you could look at for ideas. Obviously you would be implementing a subset of PCRE features.

The nice thing about Lua patterns is that I've reviewed the code and patched out the DoS vulnerabilities. I don't think there are many other regex libraries which have been verified as safe for arbitrary user input, with limited time and memory.

I don't think there are many other regex libraries which have been verified as safe for arbitrary user input, with limited time and memory.

PCRE is probably not one, especially pcre+PHP, I've seen a bunch of issues there. Also I think full PCRE is way over-powered for what we need here, and subsequently the attack area is much wider. It is also not too hard to get PCRE into pathological behavior time- and memory-wise, given control over regexp and string. Java regexps also suffer from the same problems, but I've added mitigation for it to WDQS lately.

If we run PCRE on user-supplied regexps, we should either sanitize them (no idea how), or run them in very tightly sequestered environment, if we're running PHP+PCRE essentially assuming we're running a hostile code. That seems to be kind of dangerous, so we should be very careful here.

Anomie added a subscriber: Anomie.Sep 21 2017, 7:48 PM

Lua’s regexes

One item of note is that Lua's patterns don't support basic regex features such as general alternation (only alternation over a set of single characters) or applying the Kleene star to arbitrary subpatterns (only to single characters or character sets). There's no straightforward way to do something like /foo(abc|def)bar/ or /f(oo)*/ in Lua patterns.

RE2 sounds nice… we could probably live without backreferences. But we’d need some way to use RE2 in PHP.

I don’t think Lua patterns are powerful enough to check format constraints. There are a lot of format constraints on Wikidata already, many of them fairly complicated to match weird identifier formats from all kinds of organizations, and at least general alternation is used by a lot of them.

For best compatibility with KrBot (a community-run bot that also checks constraints on Wikidata, and uses PCRE for regexes), I would strongly prefer using PCRE, in a very restricted environment of course.

daniel added a comment.EditedSep 22 2017, 1:40 PM

RE2 sounds nice… we could probably live without backreferences. But we’d need some way to use RE2 in PHP.

There seems to be an old attempt to create a php module for it: https://github.com/arraypad/php-re2

If that isn't maintained or sufficient, we can create one ourselves. WMF already maintains a couple of PHP modules to provide access to the world of C++, so that seems feasible. Johannes has recently worked on one of these modules, namely wikidiff2.

Joe added a comment.Sep 25 2017, 6:30 AM

I think re2 seems like an interesting candidate. I would argue we still want to have a separate microservice running on a separate cluster from MediaWiki, for security reasons, and I would think it could be used to run the regular expression validations as well.

AIUI, this would be called when editing, so a few milliseconds of overhead of the network call would not hurt.

That would free you from the need of support of that regex engine in PHP.

AIUI, this would be called when editing

Not quite – constraint checks (including the “format” constraint) are currently run when a user who has enabled the checkConstraints gadget views an entity page or saves a statement. (And we plan to enable that gadget by default soon.) And constraint checking for one entity may include dozens of regex checks, especially once we start checking qualifiers and references too.

If we’re going to check the regexes on a microservice, then we might as well use PCRE IMHO, for compatibility’s sake. I don’t think any of the regular regexes on Wikidata result in catastrophic runtime behavior, so if that risk is already mitigated by the microservice, I don’t think we need the extra protection+restrictions of RE2.

If we’re going to check the regexes on a microservice, then we might as well use PCRE IMHO, for compatibility’s sake. I don’t think any of the regular regexes on Wikidata result in catastrophic runtime behavior, so if that risk is already mitigated by the microservice, I don’t think we need the extra protection+restrictions of RE2.

That would allow an attacker user to bring down the microservice using a regex crafted for catastrophic backtracking. This is not just about protection against ignorance, it's also about protection against malice.

Of course, bringing down just that service is a lot better than bringing down the application server. But still, if we can protect against that, we should.

I’m still thinking of a per-request microservice (like minisparql) which gets torn down at the end of each request anyways – but even if the service handles multiple requests, we can just restart it, it’s stateless anyways. What I’m worried about is not damage to the service itself, but to anything else running on the same machine – so the service should still be limited in CPU time.

so the service should still be limited in CPU time.

...and RAM.

There's pcre.backtrack_limit and pcre.recursion_limit in PHP settings, which can be significantly reduced against defaults (which are very generous). However, I am not 100% sure it covers all scenarios where PCRE can get into the woods.

It sounds like ORES might also benefit from such a service to safely evaluate REs (T168965, blog post) – @Halfak do you agree?

Halfak added a comment.Oct 3 2017, 6:48 PM

I'm not sure we'd make use of an external service. In our case, I think a more robust timeout and some testing gets us what we need.

Krinkle updated the task description. (Show Details)Jan 24 2018, 11:15 PM

Just to clarify the situation: we do constraint checks any time a user with the checkConstraints gadget enabled visits an entity page, or after they save a statement. A constraint check can involve multiple regex checks: possibly several per statement, and a full constraint check involves all statements of an entity. We already store results of individual regex checks in the WANObjectCache (T173696 – cache hit rate roughly estimated at 30%, see Grafana), and plan to soon cache full check results as well (T181060, T184812). The gadget is currently used by 323 users, 231 of them active, according to Special:GadgetUsage, but we plan to have it enabled for all logged-in users by 2018-04-04 (T184069).

As an interim solution, would it be okay to use unsandboxed preg_match for certain regexes which are known to be safe, and to continue to use SPARQL for all others? For example, currently, 2304 out of 2999 regexes in format constraints don’t contain any parentheses (query). That’s a very crude metric, but I think it’s a safe one (plenty of safe regexes have parentheses, but I don’t think it’s possible to construct a ReDoS attack without any nesting), and if it already classifies over ¾ths of regexes as safe, that could save us a lot of SPARQL calls.

I don’t think it’s possible to construct a ReDoS attack without any nesting

Wouldn't something like a*b*[ac]*$ be still dangerous? Maybe not as dangerous as nested ones, but seems to still have some evil potential.

There's also this one: https://github.com/substack/safe-regex which may expand the possibilities a little.

safe-regex mainly works by measuring the star height of the regex, and disabling nesting is a crude way to limit the star height to 1, so that we don’t need to parse the regex. (a*b*[ac]*$ has multiple stars, but its star height is still one.)

That said, a slightly modified version of your regex (a*a*b$) does cause long query runtimes when tested against a long string of as in WDQS. But I can’t reproduce that behavior locally with Java String.matches() nor with PHP preg_match(), so I’m not sure if that’s a problem related to any regex engine or some BlazeGraph weirdness.

Hi, just wanted to mention that TechCom has been reading along here. Let us know if you need our input and/or would like an IRC meeting for finding more options, or narrowing down options, or approving a specific option.

Halfak removed a subscriber: Halfak.Jul 25 2018, 8:54 PM

BTW looks like T200630 provides another example of failing regexp, though in Python.

That said, a slightly modified version of your regex (a*a*b$) does cause long query runtimes when tested against a long string of as in WDQS.

I can’t reproduce this anymore – unfortunately I didn’t record which query I used for this, but now this query runs almost instantly. Since I’d really like to move ahead with this, I’ve filed T214378 for this improvement (because that only applies to a subset of regexes, so we can keep this task around for safely evaluating all user-provided regexes).

@Lucas_Werkmeister_WMDE do you think Tech Com can be useful here or should this task just live in a backlog?