Page MenuHomePhabricator

Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions)
Closed, ResolvedPublic13 Estimated Story Points

Description

Problem

Wikidata validates statements against the constraints from the associated property.
The constraints are user-defined and one of the possible constraint types for text values is a regex pattern.

Due to the impact of potentially malicious regexes, the MediaWiki PHP backend for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in some way.

The current workaround uses the SPARQL query service, which incurs a lot of overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which results in bad timing of the format constraint even for benign regexes. We should investigate whether we can check regexes more locally. However, the mechanism should be tightly restricted in order to avoid denial-of-service attacks via malicious regexes.

Solution

Use shellbox https://www.mediawiki.org/wiki/Shellbox / https://www.mediawiki.org/wiki/Manual:Shell_framework

! In T240884#7092435, @tstarling wrote:
Yes, the RPC endpoint was added to support this use case. Use MediaWikiServices::getInstance()->getShellboxClientFactory()->getClient()->call().

https://github.com/wikimedia/shellbox/blob/master/src/Client.php#L49-L69

Acceptance Criteria:🏕️🌟

  • Have regex checking functionality provided by shellbox
  • Have a flag for using SPARQL or Shellbox REGEX evaluation
  • Have a plan for deployment of shellbox regex checks to wikidata.org (deploying on beta,test,prod Wikidata, and the question of the big bang deployment vs gradual rollout, e.g. using some ratio / % vs some other method)

Notes:

We should reach out to SREs / service ops to tell them about the increased load on the shellbox service before deploying the new regex code.
Based on https://grafana.wikimedia.org/d/000000344/wikidata-quality?viewPanel=10&orgId=1&from=now-7d&to=now this could see peaks of 250k regex checks in a minuite? Though more regular peaks would be around 50k.

Details

SubjectRepoBranchLines +/-
operations/mediawiki-configmaster+2 -0
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -1
mediawiki/extensions/WikibaseQualityConstraintswmf/1.37.0-wmf.17+1 -1
mediawiki/extensions/WikibaseQualityConstraintswmf/1.37.0-wmf.16+1 -1
mediawiki/extensions/WikibaseQualityConstraintsmaster+1 -1
operations/mediawiki-configmaster+11 -0
mediawiki/extensions/WikibaseQualityConstraintswmf/1.37.0-wmf.16+4 -3
mediawiki/extensions/WikibaseQualityConstraintsmaster+4 -3
mediawiki/extensions/WikibaseQualityConstraintswmf/1.37.0-wmf.17+4 -3
mediawiki/extensions/WikibaseQualityConstraintsmaster+219 -213
mediawiki/extensions/WikibaseQualityConstraintsmaster+130 -23
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 698243 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/WikibaseQualityConstraints@master] [WIP] Use shellbox for regex validation

https://gerrit.wikimedia.org/r/698243

How you can test the patch above with shellbox:

  • Clone shellbox (https://gerrit.wikimedia.org/g/mediawiki/libs/Shellbox/)
  • Change the url in the config/dev-config.json to http://localhost:8080/shellbox.php and secretKey to anything else.
  • Run make run. It will build the docker files and the images and then containers and at the end run them in foreground so you'd see requests coming in.
  • Set shellbox config in your mw:
$wgShellboxUrl = 'http://localhost:8080/shellbox.php';
$wgShellboxSecretKey = 'ANYTHING;

and then set wgWBQualityConstraintsFormatCheckerShellboxRatio to 1.

One rather easy way is to see if it's working is to run FormatterCheck tests and replace mocking with the mw service. You would have to comment out Prevent real HTTP requests from tests section in MediaWikiIntegrationTestCase class.

Troubleshooting the shellbox

  • If tests are failing with "Unknown Content-Type" it means it's erroring out which is responding with Content-Type text/html which mw doesn't recognize. My biggest problem seemed to be configuration around entry point URL. Adding another array_shift( $components ); in Server::guardedExecute() fixed understanding the entry point.
  • If the secret key is not the same as the one set in mw, it will also fail the same way, you can remove the check in MultipartAction::processInput() (the !hash_equals( $multipartReader->getHash(), $authHash )).

Note
Changing php files gets reflected in shellbox server right away but changing configuration files require a rebuild.

If anyone wants to test it without needing to setup shellbox let me know to give the key for https://shellbox-beta.wmcloud.org/call

Change 699326 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/WikibaseQualityConstraints@master] Refactor FormatCheckerTest and test regexes in both Shellbox and Sparql

https://gerrit.wikimedia.org/r/699326

Change 698243 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Use shellbox for regex validation

https://gerrit.wikimedia.org/r/698243

Change 699326 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Refactor FormatCheckerTest and test regexes in both Shellbox and Sparql

https://gerrit.wikimedia.org/r/699326

Change 709821 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Add shellbox-constraint services and use them

https://gerrit.wikimedia.org/r/709821

Change 710222 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseQualityConstraints@master] Route Shellbox requests to 'constraint-regex-checker' service

https://gerrit.wikimedia.org/r/710222

Change 710092 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.16] Route Shellbox requests to 'constraint-regex-checker' service

https://gerrit.wikimedia.org/r/710092

Change 710093 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.17] Route Shellbox requests to 'constraint-regex-checker' service

https://gerrit.wikimedia.org/r/710093

Change 710222 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Route Shellbox requests to 'constraint-regex-checker' service

https://gerrit.wikimedia.org/r/710222

Change 710092 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.16] Route Shellbox requests to 'constraint-regex-checker' service

https://gerrit.wikimedia.org/r/710092

Change 710093 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.17] Route Shellbox requests to 'constraint-regex-checker' service

https://gerrit.wikimedia.org/r/710093

Mentioned in SAL (#wikimedia-operations) [2021-08-05T09:59:29Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.16/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710092|Route Shellbox requests to 'constraint-regex-checker' service (T176312)]] (duration: 01m 27s)

Mentioned in SAL (#wikimedia-operations) [2021-08-05T10:01:18Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.17/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710093|Route Shellbox requests to 'constraint-regex-checker' service (T176312)]] (duration: 01m 06s)

Change 709821 merged by jenkins-bot:

[operations/mediawiki-config@master] Add shellbox-constraint services and use them

https://gerrit.wikimedia.org/r/709821

Change 710227 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseQualityConstraints@master] Add 'constraint-regex-checker' to isEnabled() check as well

https://gerrit.wikimedia.org/r/710227

Change 710094 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.17] Add 'constraint-regex-checker' to isEnabled() check as well

https://gerrit.wikimedia.org/r/710094

Change 710095 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.16] Add 'constraint-regex-checker' to isEnabled() check as well

https://gerrit.wikimedia.org/r/710095

Mentioned in SAL (#wikimedia-operations) [2021-08-05T10:23:07Z] <ladsgroup@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:709821|Add shellbox-constraint services and use them (T176312)]], Part I (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2021-08-05T10:24:29Z] <ladsgroup@deploy1002> Synchronized wmf-config/Wikibase.php: Config: [[gerrit:709821|Add shellbox-constraint services and use them (T176312)]], Part II (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2021-08-05T10:25:38Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:709821|Add shellbox-constraint services and use them (T176312)]], Part III (duration: 01m 06s)

Change 710227 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Add 'constraint-regex-checker' to isEnabled() check as well

https://gerrit.wikimedia.org/r/710227

Change 710095 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.16] Add 'constraint-regex-checker' to isEnabled() check as well

https://gerrit.wikimedia.org/r/710095

Change 710094 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.37.0-wmf.17] Add 'constraint-regex-checker' to isEnabled() check as well

https://gerrit.wikimedia.org/r/710094

Mentioned in SAL (#wikimedia-operations) [2021-08-05T13:25:11Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.16/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710095|Add 'constraint-regex-checker' to isEnabled() check as well (T176312)]] (duration: 01m 19s)

Mentioned in SAL (#wikimedia-operations) [2021-08-05T13:27:58Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.17/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710094|Add 'constraint-regex-checker' to isEnabled() check as well (T176312)]] (duration: 01m 06s)

Change 710297 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable shellbox for constraints for 1% of wikidata

https://gerrit.wikimedia.org/r/710297

Change 710297 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable shellbox for constraints for 1% of wikidata

https://gerrit.wikimedia.org/r/710297

Mentioned in SAL (#wikimedia-operations) [2021-08-05T16:48:01Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710297|Enable shellbox for constraints for 1% of wikidata (T176312)]] (duration: 01m 27s)

Change 710315 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Increase the shellbox ratio to 5% for wikidata

https://gerrit.wikimedia.org/r/710315

Change 710315 merged by jenkins-bot:

[operations/mediawiki-config@master] Increase the shellbox ratio to 5% for wikidata

https://gerrit.wikimedia.org/r/710315

Mentioned in SAL (#wikimedia-operations) [2021-08-05T17:49:46Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710315|Increase the shellbox ratio to 5% for wikidata (T176312)]] (duration: 01m 15s)

Change 710318 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Increase the ratio for shellbox for constraints to 21% in Wikidata

https://gerrit.wikimedia.org/r/710318

Change 710318 merged by jenkins-bot:

[operations/mediawiki-config@master] Increase the ratio for shellbox for constraints to 21% in Wikidata

https://gerrit.wikimedia.org/r/710318

Mentioned in SAL (#wikimedia-operations) [2021-08-05T18:28:09Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710318|Increase the ratio for shellbox for constraints to 21% in Wikidata (T176312)]] (duration: 01m 06s)

Change 710328 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Increase the ratio for shellbox for constraints to 42% in Wikidata

https://gerrit.wikimedia.org/r/710328

Change 710328 merged by jenkins-bot:

[operations/mediawiki-config@master] Increase the ratio for shellbox for constraints to 42% in Wikidata

https://gerrit.wikimedia.org/r/710328

Mentioned in SAL (#wikimedia-operations) [2021-08-05T18:44:36Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710328|Increase the ratio for shellbox for constraints to 42% in Wikidata (T176312)]] (duration: 01m 06s)

For now:

image.png (968×1 px, 116 KB)

Will be interesting once we ramp it up to 100% which we will in Monday.

Change 710919 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable shellbox for constraint for all of wikidata

https://gerrit.wikimedia.org/r/710919

Change 710919 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable shellbox for constraint for all of wikidata

https://gerrit.wikimedia.org/r/710919

Mentioned in SAL (#wikimedia-operations) [2021-08-09T07:30:19Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710919|Enable shellbox for constraint for all of wikidata (T176312)]] (duration: 00m 58s)

Change 710938 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable shellbox constraint for commons wikis

https://gerrit.wikimedia.org/r/710938

Change 710938 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable shellbox constraint for commons wikis

https://gerrit.wikimedia.org/r/710938

Mentioned in SAL (#wikimedia-operations) [2021-08-09T10:36:54Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710938|Enable shellbox constraint for commons wikis (T176312)]] (duration: 00m 57s)

Looks like this cut the median format check time roughly in half, from ca. 15 ms to ca. 7 ms:

aliasByNode(consolidateBy(MediaWiki.wikibase.quality.constraints.check.timing.Q21502404_FormatChecker.median, 'average'), 6)

graphite.png (250×330 px, 15 KB)

That’s not bad, but it means format checks are still one of the most expensive constraint check types. I think T214378: Check simple format constraints (no grouping) in PHP instead of SPARQL might still be worth pursuing.

I’ve updated the constraint type documentation. It’s still not recommended to use PCRE-specific features in patterns, because there’s at least one other implementation (OpenRefine) that uses a different engine (in FormatScrutinizer).

Some fancy graphs:
<snip>

Looks like this cut the median format check time roughly in half, from ca. 15 ms to ca. 7 ms:

This is great \o/

One note is that I don't think we ever did any real tuning of shellbox-constraints's deployment, I just copied the Score shellbox configuration (1 CPU, 2G memory). My feeling is that shellbox-constraints is probably over-provisioned, but it's not really worth trying to cut it down. Just something to keep an eye on as future growth happens.

🎉🎉🎉🎉🎉
Thanks to all involved!!!!