Wikidata validates statements against the constraints from the associated property. The constraints are user-defined and one of the possible constraint types for text values is a regex pattern.
Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in some way.
The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.
I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox?
We can’t directly uses Lua’s regexes because their syntax is too different from PCRE, and implementig a PCRE-compatible engine in Lua is probably too much work. I’m not familiar enough with firejail to comment on that option.
- Lua (via PHP binding?)
- PHP program called as sub process within a Firejail.
- re2 (https://github.com/google/re2), either via a microservice, or via PHP binding.