Page MenuHomePhabricator

Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions)
Open, Needs TriagePublic13 Estimated Story Points

Description

Problem

Wikidata validates statements against the constraints from the associated property.
The constraints are user-defined and one of the possible constraint types for text values is a regex pattern.

Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in some way.

The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.

Solution

Use shellbox https://www.mediawiki.org/wiki/Shellbox / https://www.mediawiki.org/wiki/Manual:Shell_framework

! In T240884#7092435, @tstarling wrote:
Yes, the RPC endpoint was added to support this use case. Use MediaWikiServices::getInstance()->getShellboxClientFactory()->getClient()->call().

https://github.com/wikimedia/shellbox/blob/master/src/Client.php#L49-L69

Acceptance Criteria:🏕️🌟

  • Have regex checking functionality provided by shellbox
  • Have a flag for using SPARQL or Shellbox REGEX evaluation
  • Have a plan for deployment of shellbox regex checks to wikidata.org (deploying on beta,test,prod Wikidata, and the question of the big bang deployment vs gradual rollout, e.g. using some ratio / % vs some other method)

Notes:

We should reach out to SREs / service ops to tell them about the increased load on the shellbox service before deploying the new regex code.
Based on https://grafana.wikimedia.org/d/000000344/wikidata-quality?viewPanel=10&orgId=1&from=now-7d&to=now this could see peaks of 250k regex checks in a minuite? Though more regular peaks would be around 50k.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It sounds like ORES might also benefit from such a service to safely evaluate REs (T168965, blog post) – @Halfak do you agree?

I'm not sure we'd make use of an external service. In our case, I think a more robust timeout and some testing gets us what we need.

Just to clarify the situation: we do constraint checks any time a user with the checkConstraints gadget enabled visits an entity page, or after they save a statement. A constraint check can involve multiple regex checks: possibly several per statement, and a full constraint check involves all statements of an entity. We already store results of individual regex checks in the WANObjectCache (T173696 – cache hit rate roughly estimated at 30%, see Grafana), and plan to soon cache full check results as well (T181060, T184812). The gadget is currently used by 323 users, 231 of them active, according to Special:GadgetUsage, but we plan to have it enabled for all logged-in users by 2018-04-04 (T184069).

As an interim solution, would it be okay to use unsandboxed preg_match for certain regexes which are known to be safe, and to continue to use SPARQL for all others? For example, currently, 2304 out of 2999 regexes in format constraints don’t contain any parentheses (query). That’s a very crude metric, but I think it’s a safe one (plenty of safe regexes have parentheses, but I don’t think it’s possible to construct a ReDoS attack without any nesting), and if it already classifies over ¾ths of regexes as safe, that could save us a lot of SPARQL calls.

I don’t think it’s possible to construct a ReDoS attack without any nesting

Wouldn't something like a*b*[ac]*$ be still dangerous? Maybe not as dangerous as nested ones, but seems to still have some evil potential.

There's also this one: https://github.com/substack/safe-regex which may expand the possibilities a little.

safe-regex mainly works by measuring the star height of the regex, and disabling nesting is a crude way to limit the star height to 1, so that we don’t need to parse the regex. (a*b*[ac]*$ has multiple stars, but its star height is still one.)

That said, a slightly modified version of your regex (a*a*b$) does cause long query runtimes when tested against a long string of as in WDQS. But I can’t reproduce that behavior locally with Java String.matches() nor with PHP preg_match(), so I’m not sure if that’s a problem related to any regex engine or some BlazeGraph weirdness.

Hi, just wanted to mention that TechCom has been reading along here. Let us know if you need our input and/or would like an IRC meeting for finding more options, or narrowing down options, or approving a specific option.

BTW looks like T200630 provides another example of failing regexp, though in Python.

That said, a slightly modified version of your regex (a*a*b$) does cause long query runtimes when tested against a long string of as in WDQS.

I can’t reproduce this anymore – unfortunately I didn’t record which query I used for this, but now this query runs almost instantly. Since I’d really like to move ahead with this, I’ve filed T214378 for this improvement (because that only applies to a subset of regexes, so we can keep this task around for safely evaluating all user-provided regexes).

@Lucas_Werkmeister_WMDE do you think Tech Com can be useful here or should this task just live in a backlog?

Just as an aside, the dirt simple solution here would be to shell out to grep -p (or even just to php fed just the preg_match call) and rely on limit.sh to prevent undue resourse usage.

Edit: sorry, i see now this was already mentioned in another comment

Krinkle moved this task from Under discussion to P3: Explore on the TechCom-RFC board.
Krinkle added a subscriber: Krinkle.
Krinkle updated the task description. (Show Details)

I've merged the RFC bits into T240884, which seems to be about the same thing.

Pending confirmation on the RFC at T240884#7019661
And then this would be blocked on T260330 & some sub tasks there to be able to pick this up

Yes, the RPC endpoint was added to support this use case. Use MediaWikiServices::getInstance()->getShellboxClientFactory()->getClient()->call().

Change 698243 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/WikibaseQualityConstraints@master] [WIP] Use shellbox for regex validation

https://gerrit.wikimedia.org/r/698243

How you can test the patch above with shellbox:

  • Clone shellbox (https://gerrit.wikimedia.org/g/mediawiki/libs/Shellbox/)
  • Change the url in the config/dev-config.json to http://localhost:8080/shellbox.php and secretKey to anything else.
  • Run make run. It will build the docker files and the images and then containers and at the end run them in foreground so you'd see requests coming in.
  • Set shellbox config in your mw:
$wgShellboxUrl = 'http://localhost:8080/shellbox.php';
$wgShellboxSecretKey = 'ANYTHING;

and then set wgWBQualityConstraintsFormatCheckerShellboxRatio to 1.

One rather easy way is to see if it's working is to run FormatterCheck tests and replace mocking with the mw service. You would have to comment out Prevent real HTTP requests from tests section in MediaWikiIntegrationTestCase class.

Troubleshooting the shellbox

  • If tests are failing with "Unknown Content-Type" it means it's erroring out which is responding with Content-Type text/html which mw doesn't recognize. My biggest problem seemed to be configuration around entry point URL. Adding another array_shift( $components ); in Server::guardedExecute() fixed understanding the entry point.
  • If the secret key is not the same as the one set in mw, it will also fail the same way, you can remove the check in MultipartAction::processInput() (the !hash_equals( $multipartReader->getHash(), $authHash )).

Note
Changing php files gets reflected in shellbox server right away but changing configuration files require a rebuild.

If anyone wants to test it without needing to setup shellbox let me know to give the key for https://shellbox-beta.wmcloud.org/call

Change 699326 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/WikibaseQualityConstraints@master] Refactor FormatCheckerTest and test regexes in both Shellbox and Sparql

https://gerrit.wikimedia.org/r/699326

Change 698243 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Use shellbox for regex validation

https://gerrit.wikimedia.org/r/698243

Change 699326 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Refactor FormatCheckerTest and test regexes in both Shellbox and Sparql

https://gerrit.wikimedia.org/r/699326