=Very WIP=
= Problem statement =
Wikidata validates statements against the constraints from the associated property. The constraints are user-defined and one of the possible constraint types for text values is a regex pattern.
Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in some way.
The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.
= Previous discussions =
* {T176312}
== Alternative ideas ==
* Evaluating the regex using Lua
** Pros: No need for another service, easier to implement, more performant due to lack of network roundtrips
** Cons: Lua doesn't implement most of PCRE features: T176312#3625405
* PHP program called as sub process within a Firejail.
** Pros: More performant due to lack of network roundtrips
** Cons: It's less secure, other services need to re-implement it as well.
* Using re2:
** Pros: It works...
** Cons: ... partially. Still needs php binding or service.
= Proposed solution =
The proposed solution is to have a stand-alone service sandboxed for evaluating user-provided regex accessible using gRPC from the rest of infrastructure including mediawiki nodes.
A possible client-side design would be something like this:
```lang=php
$client = new GrpcRegexClient('node-server:9090', [
'credentials' => Grpc\ChannelCredentials::createInsecure(),
]);
$request = new GrpcRegexRequest();
$request->setRegex( $userProvidedRegex );
$request->setText( $userProvidedText );
list($response, $status) = $client->Evaluate( $request )->wait();
echo $response->getResult()."\n";
```
And on the server-side we will have a service (either by nodejs, php, python, that doesn't matter) that takes regex and string and evaluate it (maybe there's some codebase for this already?)