### Problem statement* Affected components: Wikibase extension for MediaWiki.
* Engineer for initial implementation: Wikidata team (WMDE).
* Code steward: Wikidata team (WMDE).
### Motivation
Wikidata validates statements against the constraints from the associated property. The constraints are user-defined and one of the possible constraint types for text values is a regex pattern. As an example [[https://www.wikidata.org/wiki/Property:P345|IMDb values]] need to [[https://www.wikidata.org/wiki/Property:P345#P1793|follow this regex]]: `ev\d{7}\/(19|20)\d{2}(-\d)?|(ch|co|ev|tt|nm)\d{7}|(tt|ni|nm)\d{8}` and items not complying with this regex will cause a constraint violation. It's important to note that both regex and the text we evaluate the regex against are provided by users.
Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's `preg_match`. Instead, we need to isolate this in some way.
The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.
@Krinkle wrote in T173696:
> I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox?
We can’t directly uses Lua’s regexes because their syntax is too different from PCRE, and implementig a PCRE-compatible engine in Lua is probably too much work. I’m not familiar enough with firejail to comment on that option.
##### Requirements
* Hard execution time limit.
* Isolate security-wise from main PHP process.
If we use a standalone service (either backend by Python, PHP, or Node.js) that takes a regex and a string and evaluates it. This separation makes sure that A) there are proper circuit-breakers in place and B) if a malicious request manages to break out, it won't be able to access private data or cause large scale damage due to being in its own isolated service.-------
**Prior discussions**### Exploration
* Lua (via PHP binding?)
* PHP program called as sub process within a Firejail.
* re2 (<https://github.com/google/re2>), either via a microservice, or via PHP binding.
* See
* ...
**Prior discussions**:
* {T176312}
### ProposalsIf we use a standalone service (either backend by Python, PHP, or Node.js) that takes a regex and a string and evaluates it. This separation makes sure that A) there are proper circuit-breakers in place and B) if a malicious request manages to break out, it won't be able to access private data or cause large scale damage due to being in its own isolated service.
#### Proposal 1: Evaluate regexes using Lua.
Lua has built-in support for hard time and memory limits, which we have prior experienced with via Scribunto and LuaSandbox.
**Pros**
* No need for another service.
* Easy to implement.
* Low latency and no no network roundtrip (more performant than current solution).
**Cons**
* Lua doesn't implement most of PCRE features, we'd have to migrate existing regexes to this new simplified syntax which might hard or impossible, ref T176312#3625405
#### Proposal 2: Firejail'ed PHP microservice/subprocess.
While calling `preg_match` within the main process is hard to limit safely, it is much easier to limit a subprocess spawned by PHP. Both in terms of cross-socket time timeouts (on the caller end) but also on the other end with Firejail and Ulimit to enforce limited memory and execution time.
**Pros**
* Easy to implement.
* Low latency and no network roundtrip (more performant than current solution).
**Cons**
* Harder to secure given PHP has many capabilities.
* Harder to re-use for non-MediaWiki services. (TODO: why?)
#### Proposal 3: re2-based microservice/subprocess
The `re2` C-library from Google implements regular expressions in a sandboxed and linear-time fashion with the ability to set a hard memory and time budget. It is also fast/faster than PCRE in most cases. <https://github.com/google/re2/wiki/WhyRE2> <https://github.com/google/re2>
The library would be wrapped in a simple C service (like Poolcounter) or in a scripting language with a known binding (e.g. Python, Erlang, Perl, Ruby).
**Pros**
* Low latency and no network roundtrip (more performant than current solution).
* Can perform significantly better than other proposals (e.g. if implemented as a deamon with a worker pool, no process start overhead).
* Easier to trust security-wise (no PHP involved).
* Capable of (most) PCRE features.
* Easy to re-use.
* (Maybe) Easier to promote in open-source by not having a dependency on any short-lived major version of PHP/Node.js.
**Cons**
* Harder to implement (although still relatively easy).
### RPC framework
If we go with Proposal 3 (or another solution that may involve network overhead), then this might be a good candidate to try out **[gRPC](https://www.grpc.io/)** instead of using traditional HTTP+JSON communication between client and server. This makes the network overhead slightly lower due to being a more efficient network protocol. A possible client-side design would be something like this:
```lang=php
$client = new GrpcRegexClient('node-server:9090', [
'credentials' => Grpc\ChannelCredentials::createInsecure(),
]);
$request = new GrpcRegexRequest();
$request->setRegex( $userProvidedRegex );
$request->setText( $userProvidedText );
list($response, $status) = $client->Evaluate( $request )->wait();
echo $response->getResult()."\n";
```