Page MenuHomePhabricator

WDQS's REGEX() function is not subject to timeout
Closed, ResolvedPublic

Description

The REGEX() function in the Wikidata Query Service is not subject to the timeout of 60 s, allowing an attacker to take up much more CPU time on the service than should be allowed.

One example of a pathological regular expression is (x+x+)+y (from CodingHorror), which takes exponential time (twice as long per extra input character) to verify that xxxxx… does not match the regular expression.

The following query returned false after 90 s:

SELECT (REGEX("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y") AS ?b) {}

The following query resulted in a 504 Gateway Time-out error after an unknown duration (I didn’t check, probably between 3 and 15 minutes):

SELECT (REGEX("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y") AS ?b) {}

BlazeGraph implements the REGEX() function directly with java.util.regex regexes, which are implemented in C and not interruptible (which is why the timeout has no effect). There is a workaround (custom CharSequence implementation), but the real fix should be for BlazeGraph to implement regular expressions properly – SPARQL’s regular expression language is that of XQuery 1.0 and XPath 2.0, which AFAIU is not identical to Java regexes anyways (for instance, I’m pretty sure it shouldn’t support Java’s named capturing group syntax, i. e. SELECT (REGEX("aa", "(?<name>a)\\k<name>") AS ?b) WHERE {} should result in an error).

Event Timeline

I’m pretty sure it shouldn’t support Java’s named capturing group syntax

While this support is not requires, I don't think it hurts anything, so any engine that supports basic capabilities should be fine.

TCL/HSRE doesn’t seem to support the \p{Lu} syntax (for Unicode categories).

I also disagree that support for nonstandard features is okay – the SPARQL spec says that REGEX is exactly XPath fn:matches (and REPLACE is exactly XPath fn:replace). And according to the XPath spec, those functions should raise errors if the pattern is invalid.

More practically speaking, supporting nonstandard features means that users will have no indication that they’re using a nonstandard feature, and that we’ll break queries whenever we switch regex engines.

Legoktm renamed this task from REGEX() function is not subject to timeout to WDQS's REGEX() function is not subject to timeout.Jul 7 2017, 9:20 AM

the SPARQL spec says that REGEX is exactly XPath fn:matches (and REPLACE is exactly XPath fn:replace). And according to the XPath spec, those functions should raise errors if the pattern is invalid.

Frankly, I see little use in this requirement. Who ever needs their query to fail?

More practically speaking, supporting nonstandard features means that users will have no indication that they’re using a nonstandard feature, and that we’ll break queries whenever we switch regex engines.

We already support many features not present in standard SPARQL. Yes, they won't work if we ever switch engines. But that's not something we do frequently, only when really necessary, and with proper announcements, etc. If we find RE engine that matches XPath requirement exactly to the point, fine. But if not, that should not be a deal-breaker by itself.

thiemowmde added a subscriber: Lydia_Pintscher.

If I might add my two cents:

  • Sure, having any regex engine is better than nothing. But a regex engine that supports so much more than expected from the SPAQRL standard is a problem, and should not be denied so easily. Users will not read what the standard says, but just play around with the service and try what works and what not. They will write all kinds of advanced regular expressions that make use of the undocumented features and start relying on them. This is when it becomes a problem. But this is not what this ticket is about.
  • This ticket is about this feature possibly being used as a DoS vector, either intentionally or unintentionally when more users use more complicated queries that contain regular expressions. Is this an actual problem? Do we need to do something about this in advance? I can't tell.

It looks like there's actually two issues here:

  1. regex is not interrupted by regular query timeout handling in Blazegraph
  2. Constant expressions seem to be handled by main service thread and not query executor thread, and the former is not subject to query executor timeouts.

(1) can be fixed/mitigated by InterruptibleCharacterSequence hack linked above. Still looking into the fix for (2), it may be trickier.

This is subject to regular timeouts now. Blazegraph issue is still not fully resolved, but I made some quick patches for the meantime.

Is the fix for this issue public somewhere? I can’t find any commits in the wikidata/query/rdf repository that look related.

And by the way, do we still need a custom visibility policy for this task or can it be public now?

And (sorry forgot the fix for the 2nd issue): https://github.com/blazegraph/database/commit/3ddea636ee6e2b6ea0c3fdcae41b36d8ff93e3e6 which is also shipped with blazegraph 2.1.5

I can’t figure out how to make this task public. @chasemp can you please do it?

I think we can make this task public, the workaround has been shipped to blazegraph 2.1.5 via https://github.com/blazegraph/database/commit/d13f320ffefc90d668a3411c466ec3c1adf00e10

I can’t figure out how to make this task public. @chasemp can you please do it?

Macro challenge-accepted:

chasemp changed the visibility from "Custom Policy" to "Public (No Login Required)".Sep 16 2020, 1:45 PM