Page MenuHomePhabricator

RFC: How to evaluate user-provided regular expressions
Open, MediumPublic

Description

  • Affected components: Wikibase extension for MediaWiki.
  • Engineer for initial implementation: Wikidata team (WMDE).
  • Code steward: Wikidata team (WMDE).

Motivation

Wikidata validates statements against the constraints from the associated property. The constraints are user-defined and one of the possible constraint types for text values is a regex pattern. As an example IMDb values need to follow this regex: ev\d{7}\/(19|20)\d{2}(-\d)?|(ch|co|ev|tt|nm)\d{7}|(tt|ni|nm)\d{8} and items not complying with this regex will cause a constraint violation. It's important to note that both regex and the text we evaluate the regex against are provided by users.

Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in some way.

The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.

@Krinkle wrote in T173696:

I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox?

We can’t directly uses Lua’s regexes because their syntax is too different from PCRE, and implementig a PCRE-compatible engine in Lua is probably too much work. I’m not familiar enough with firejail to comment on that option.

Requirements
  • Hard execution time limit.
  • Isolate security-wise from main PHP process.

Exploration

  • Lua (via PHP binding?)
  • PHP program called as sub process within a Firejail.
  • re2 (https://github.com/google/re2), either via a microservice, or via PHP binding.
  • See
  • ...

Prior discussions:

If we use a standalone service (either backend by Python, PHP, or Node.js) that takes a regex and a string and evaluates it. This separation makes sure that A) there are proper circuit-breakers in place and B) if a malicious request manages to break out, it won't be able to access private data or cause large scale damage due to being in its own isolated service.

Proposal 1: Evaluate regexes using Lua.

Lua has built-in support for hard time and memory limits, which we have prior experienced with via Scribunto and LuaSandbox.

Pros

  • No need for another service.
  • Easy to implement.
  • Low latency and no no network roundtrip (more performant than current solution).

Cons

  • Lua doesn't implement most of PCRE features, we'd have to migrate existing regexes to this new simplified syntax which might hard or impossible, ref T176312#3625405
Proposal 2: Firejail'ed PHP microservice/subprocess.

While calling preg_match within the main process is hard to limit safely, it is much easier to limit a subprocess spawned by PHP. Both in terms of cross-socket time timeouts (on the caller end) but also on the other end with Firejail and Ulimit to enforce limited memory and execution time.

Pros

  • Easy to implement.
  • Low latency and no network roundtrip (more performant than current solution).

Cons

  • Harder to secure given PHP has many capabilities.
  • Harder to re-use for non-MediaWiki services. (TODO: why?)
Proposal 3: re2-based microservice/subprocess

The re2 C-library from Google implements regular expressions in a sandboxed and linear-time fashion with the ability to set a hard memory and time budget. It is also fast/faster than PCRE in most cases. https://github.com/google/re2/wiki/WhyRE2 https://github.com/google/re2

The library would be wrapped in a simple C service (like Poolcounter) or in a scripting language with a known binding (e.g. Python, Erlang, Perl, Ruby).

Pros

  • Low latency and no network roundtrip (more performant than current solution).
  • Can perform significantly better than other proposals (e.g. if implemented as a deamon with a worker pool, no process start overhead).
  • Easier to trust security-wise (no PHP involved).
  • Capable of (most) PCRE features.
  • Easy to re-use.
  • (Maybe) Easier to promote in open-source by not having a dependency on any short-lived major version of PHP/Node.js.

Cons

  • Harder to implement (although still relatively easy).

RPC framework

If we go with Proposal 3 (or another solution that may involve network overhead), then this might be a good candidate to try out gRPC instead of using traditional HTTP+JSON communication between client and server. This makes the network overhead slightly lower due to being a more efficient network protocol. A possible client-side design would be something like this:

$client = new GrpcRegexClient('node-server:9090', [
    'credentials' => Grpc\ChannelCredentials::createInsecure(),
]);

$request = new GrpcRegexRequest();
$request->setRegex( $userProvidedRegex );
$request->setText( $userProvidedText );
list($response, $status) = $client->Evaluate( $request )->wait();
echo $response->getResult()."\n";

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 16 2019, 6:35 PM
Ladsgroup updated the task description. (Show Details)Dec 17 2019, 1:17 PM
Ladsgroup updated the task description. (Show Details)
Ladsgroup updated the task description. (Show Details)Dec 18 2019, 5:09 PM
Ladsgroup updated the task description. (Show Details)
Krinkle renamed this task from Stand-alone service to evaluate user-provided regular expressions to Standalone service to evaluate user-provided regular expressions.Jan 8 2020, 9:27 PM
Krinkle updated the task description. (Show Details)Jan 8 2020, 10:07 PM

Based on today's TechCom meeting I've updated the task description to better separate the three proposals, and added a Requirements section.

I've also fleshed out the re2-based solution description and clarified that gRPC is not itself a critical part of any of the solutions. Rather, it is brought up here because there is interest from involved parties to try out a different way of communicating over the network between client and server. And, if we end up with an HTTP-based standalone service, we can consider gRPC as alternative to HTTP/JSON. Though this is mainly an implementation detail and not significant in terms requirements or pros/cons.

Volans added a subscriber: Volans.Jan 8 2020, 10:58 PM

Though this is mainly an implementation detail and not significant in terms requirements or pros/cons.

I disagree for a couple of reasons: gRPC is faster. According to some measurements in ASP.net (not in php) it's seven times faster than HTTP/JSON. That would be an important factor in deciding whether we should go with standalone service or another direction.

The other reason I think it's important that is this would be the first time we are going to use gRPC in production, meaning introducing new dependencies (in php) and services, this is cross cutting and would involve more work from services and SRE than HTTP+JSON solution. Also, another reason also is that the API spec of the regex implementation is hard to undo as it'll be used in several places not just one part of Wikibase.

Joe added a subscriber: Joe.Jan 13 2020, 10:25 AM

Though this is mainly an implementation detail and not significant in terms requirements or pros/cons.

I disagree for a couple of reasons: gRPC is faster. According to some measurements in ASP.net (not in php) it's seven times faster than HTTP/JSON. That would be an important factor in deciding whether we should go with standalone service or another direction.

The other reason I think it's important that is this would be the first time we are going to use gRPC in production, meaning introducing new dependencies (in php) and services, this is cross cutting and would involve more work from services and SRE than HTTP+JSON solution. Also, another reason also is that the API spec of the regex implementation is hard to undo as it'll be used in several places not just one part of Wikibase.

I think you're both right - gRPC is an implementation detail in answering the first question:

  • Does a standalone service make sense?

but if the answer is yes, then it's something we should discuss in the RFC process for the reasons @Ladsgroup outlined.

Joe added a comment.Jan 13 2020, 10:39 AM

I think the main question to answer is "does it make sense to create a safe regex evaluation service?".

I think in a void the answer is "no". It could make sense to create a small C++ program wrapping the main re2 functionality and shell out to it from php.

On the other hand, we have to consider the wikimedia infrastructure for this and there are two counterpoints to be made:

  • Is this a service we can only expect MediaWiki to call? If not, that's a point in favour of creating a separate service
  • Shelling out for us works well by using a combination of firejail and cgroups creation that won't work well in the future with cgroups v2 and containerization
  • Performance might not be extremely relevant

Now on the last point: this proposal seems to worry a lot about performance, but I see no performance requirement spelled out. Without more context, both the choice of shelling out vs and RPC service, and the proposal to use gRPC for said service seem to me like premature optimizations.

So my questions are:

  • What is the 95th percentile of latency in validating all the constraint on an item when editing it?
  • What is the average, median and max number of regexes we need to validate per item?

Without answering those questions, we would just make choices by principle, while I think we should have a more pragmatic approach.

So, some of the numbers that we already have:

https://grafana.wikimedia.org/d/000000344/wikidata-quality?orgId=1&from=now-1h&to=now&fullscreen&panelId=10 has some infomation about how many times we currently miss our regex cache within wikibasequalityconstraints and request wdqs to check our regex.
I believe ever cache miss in that graph will result in 1 call to sparql to check 1 regular expression against 1 string.
@Lucas_Werkmeister_WMDE please correct me if I am wrong!

So looking at some numbers from that, 500-12,000 regex checks per minute.
Once T204031 moves from 50 to 100% this could double.

The other way of looking at this is, after every edit, these constraint checks will run, which most likely will involve running a regular expression.
Wikidata edit rate: https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&fullscreen&panelId=1
So normally 400-800 edits per minute / batches of regex runs.

I'll try and pull a few more numbers out closer to what you have requested too.

I have created a temporary dashboard at https://grafana.wikimedia.org/d/HUGEtYPWz/t240884?orgId=1 with some of these number pulled out.

The "Individual regex runs" panel covers what I said in T240884#5802852 about individual regex being run on strings.
As said in the panel description this will happen multiple times per edit / item.

The second panel is the p95 timing for the constraint check that includes a regex run.
Again, multiple of these checks will happen in a single edit / for a single item.

The third panel is my very poor first attempt at figuring out how many of these checks roughly run per edit.
These seem to range from 5-~80(peak) but normally staying between the 5-30 range.
I might be able to come up with a better number for this based on a dump before the meeting.

I believe ever cache miss in that graph will result in 1 call to sparql to check 1 regular expression against 1 string.

I think that’s correct, we don’t batch these requests at the moment.

The second panel is the p95 timing for the constraint check that includes a regex run.
Again, multiple of these checks will happen in a single edit / for a single item.

If I understand correctly, this is just the time of the individual format constraint check itself. A full constraint check for an item will typically involve several format constraint checks, as well as other constraint checks of various types. (I think that’s what you meant as well? The phrasing “constraint check that includes regex” confuses me a bit, because there isn’t really much else happening there besides the regex, except for a bit of housekeeping like getting the string out of the Wikibase value, and assembling the result.)

If I understand correctly, this is just the time of the individual format constraint check itself. A full constraint check for an item will typically involve several format constraint checks, as well as other constraint checks of various types. (I think that’s what you meant as well?

Yep

The phrasing “constraint check that includes regex” confuses me a bit, because there isn’t really much else happening there besides the regex, except for a bit of housekeeping like getting the string out of the Wikibase value, and assembling the result.)

That phrasing is rather meant to distinguish from other constraint check types that do nothing relating to regex at all

I think the main question to answer is "does it make sense to create a safe regex evaluation service?".

I think in a void the answer is "no". It could make sense to create a small C++ program wrapping the main re2 functionality and shell out to it from php.

On the other hand, we have to consider the wikimedia infrastructure for this and there are two counterpoints to be made:

  • Is this a service we can only expect MediaWiki to call? If not, that's a point in favour of creating a separate service
  • Shelling out for us works well by using a combination of firejail and cgroups creation that won't work well in the future with cgroups v2 and containerization
  • Performance might not be extremely relevant

Now on the last point: this proposal seems to worry a lot about performance, but I see no performance requirement spelled out. Without more context, both the choice of shelling out vs and RPC service, and the proposal to use gRPC for said service seem to me like premature optimizations.

So my questions are:

  • What is the 95th percentile of latency in validating all the constraint on an item when editing it?
  • What is the average, median and max number of regexes we need to validate per item?

Without answering those questions, we would just make choices by principle, while I think we should have a more pragmatic approach.

I think there is a question of whether we are just going to have the wikidata thing use this, or eventually all user (including admin) provided regexes? If we also eventually move AbuseFilter & SpamBlacklist to this, I could see performance possibly being a concern since these features often involve running quite a large number of regexes that block saving a page until they complete (That said, I don't have any specific numbers beyond a gut feeling that it is so).

There is https://pecl.php.net/package/re2 . It was written for PHP 5 and was never updated after its initial release in 2011, but we have the skills to update it for PHP 7 and review it for security. If we believe in RE2 then we shouldn't be afraid to invest in it.

daniel added a subscriber: daniel.Jan 16 2020, 11:23 AM

Condensed outcome of a conversation between @Ladsgroup, @Addshore, @Joe, @Krinkle, @tstarling, and myself:

  • If there are other use cases for re2 in MediaWiki, go for a native php binding for re2.
  • If there are other use cases for gRPC on our cluster, try to use it instead of JSON-over-HTTP REST.
  • Otherwise, the most straight forward option seems to be a simple REST service written in node.js using the re2 npm.

So, one key question to answer in this RFC is: Are there other people/projects/teams interested in re2 or gRPC? What are their needs and plans?

So, one key question to answer in this RFC is: Are there other people/projects/teams interested in re2 or gRPC? What are their needs and plans?

One complicating factor here is that AbuseFilter and SpamBlacklist both don't have a clear maintainer.

One complicating factor here is that AbuseFilter and SpamBlacklist both don't have a clear maintainer.

I think @Daimona is understood to be the de facto AF maintainer these days (trusted dev, wmf-NDA, etc.) and is pretty active in its current development.

One complicating factor here is that AbuseFilter and SpamBlacklist both don't have a clear maintainer.

I think @Daimona is understood to be the de facto AF maintainer these days (trusted dev, wmf-NDA, etc.) and is pretty active in its current development.

So, I'm going to answer for myself. I think a re2-like solution would indeed improve performance [1] for regexps-related extensions. AbuseFilter and SpamBlacklist for sure, but also TitleBlacklist, and CentralAuth as of T101615. Given the number of possible consumers, I believe that a reusable service would be the best choice.

Of note, there's also T187669 about adding a static ReDoS validator, in case you want to explore it as an alternative.

[1] - About AbuseFilter performance, some numbers are on grafana, and there's also a dashboard on logstash, although regexps aren't the only responsible for slowness.

Joe added a comment.Jan 22 2020, 9:42 PM

One complicating factor here is that AbuseFilter and SpamBlacklist both don't have a clear maintainer.

I think @Daimona is understood to be the de facto AF maintainer these days (trusted dev, wmf-NDA, etc.) and is pretty active in its current development.

So, I'm going to answer for myself. I think a re2-like solution would indeed improve performance [1] for regexps-related extensions. AbuseFilter and SpamBlacklist for sure, but also TitleBlacklist, and CentralAuth as of T101615. Given the number of possible consumers, I believe that a reusable service would be the best choice.

Of note, there's also T187669 about adding a static ReDoS validator, in case you want to explore it as an alternative.

[1] - About AbuseFilter performance, some numbers are on grafana, and there's also a dashboard on logstash, although regexps aren't the only responsible for slowness.

Performace would be better if you interpret regexes locally with re2, so by using i.e. a php extension as @tstarling suggested.

The advantage of a standalone service is that it works better if we need to use re2 from different services, so not just within MediaWiki.

Krinkle renamed this task from Standalone service to evaluate user-provided regular expressions to RFC: How to evaluate user-provided regular expressions.Apr 3 2020, 10:11 PM
Krinkle updated the task description. (Show Details)
Krinkle moved this task from Under discussion to P3: Explore on the TechCom-RFC board.

I wonder if T260330: RFC: PHP microservice for containerized shell execution could be used for this? (That task is apparently now in progress.)

Base added a subscriber: Base.Aug 18 2020, 12:16 PM

I wonder if T260330: RFC: PHP microservice for containerized shell execution could be used for this? (That task is apparently now in progress.)

I'm not sure that's what you want. I'm doing shell execution by RPC, but it sounds like you just want RPC. Maybe a similar solution is needed?

Well, we could use something like php -r 'echo preg_match( $user_regex, $input );' as the shell command, effectively going back to proposal 2 rather than 3. (For some reason, I also assumed that T260330 included built-in support for running PHP code and not just shell commands, but I think that was a misunderstanding.)

OK, I'm adding PHP execution to the service.