=Very WIP=
= Problem statement =
Wikidata validates statements against the constraints from the associated property. The constraints are user-defined and one of the possible constraint types for text values is a regex pattern. As an example [[https://www.wikidata.org/wiki/Property:P345|IMDb values]] need to [[https://www.wikidata.org/wiki/Property:P345#P1793|follow this regex]]: `ev\d{7}\/(19|20)\d{2}(-\d)?|(ch|co|ev|tt|nm)\d{7}|(tt|ni|nm)\d{8}` and items not complying with this regex will cause a constraint violation. It's important to note that both regex and the text we evaluate the regex against are provided by users.
Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in some way.
The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.
= Previous discussions =
* {T176312}
== Alternative ideas ==
* Evaluating the regex using Lua
** Pros: No need for another service, easier to implement, more performant due to lack of network roundtrips
** Cons: Lua doesn't implement most of PCRE features: T176312#3625405
* PHP program called as sub process within a Firejail.
** Pros: More performant due to lack of network roundtrips, less work to implement.
** Cons: It's less secure, other services need to re-implement it as well.
* Using re2:
** Pros: It works...
** Cons: ... partially. Still needs php binding or service.
= Proposed solution =
The proposed solution is to have a stand-alone service sandboxed for evaluating user-provided regex accessible via gRPC from the rest of infrastructure including mediawiki nodes.
A possible client-side design would be something like this:
```lang=php
$client = new GrpcRegexClient('node-server:9090', [
'credentials' => Grpc\ChannelCredentials::createInsecure(),
]);
$request = new GrpcRegexRequest();
$request->setRegex( $userProvidedRegex );
$request->setText( $userProvidedText );
list($response, $status) = $client->Evaluate( $request )->wait();
echo $response->getResult()."\n";
```
And on the server-side we will have a service (either by nodejs, php, python, that doesn't matter) that takes a regex and a string and evaluates it (maybe there's some npm library for this already). This sandboxing makes sure a) there are proper circuit breakers in place and b) if a malicious request manages to break out, it won't be able to access private data or cause large scale damage.