Page MenuHomePhabricator

[Task] Evaluate pattern constraints (safely)
Closed, ResolvedPublic

Description

Pattern constraints are based on regular expressions. Evaluating user supplied regular expressions on the server is however a potential DoS vector, since it is easy to write malicious expressions that would use a lot of CPU and RAM when evaluated, typically by causing catastrophic backtracking. Furthermore, care must be taken to prevent PHP code injection via the /e flag.

For this reason, pattern constraints are currently disabled.

This ticket calls for an investigation and implementation that would allow pattern constraints to be applied safely. This may be done by sufficiently sandboxing PCRE evaluation, or by moving to a different language for patterns (e.g. an extended version of glob, or a restricted form of regular expressions).

Note that for the use case at hand, namely, validating external identifiers, the full power of regular expressions is not needed. A simpler subset that is restricted to greedy possessive matches (that is, no backtracking) may be sufficient.


Relevant thread on StackOverflow: https://stackoverflow.com/questions/31256970/how-do-i-sandbox-the-evaluation-of-user-supplied-patterns

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Using PCRE limits to restrict backtracking and recursion would be one possible approach.

However, this means some expressions will fail to evaluate on some input. There is no way to indicate this to the user who created the pattern. Also, it may be tricky to detect evaluation failures, and it's unclear how to surface them if they are detected.

Glob patterns may be an option, using fnmatch. However, glob doesn't have a way to define a fixed number of repetitions, so "four digits" would have to be written as [0-9][0-9][0-9][0-9]. Also, the common "*" and "?" wildcards are not very useful for the use case at hand.

I did some analysis of what regex features are actually used: https://www.wikidata.org/wiki/User:Popcorndude/formats

Only 6 properties use backrefs, lookahead, or lookbehind: P1256, P1257, P1297, P1472, P1612, P1793. Everything else uses only "?*+ [] () | {} {,} \d\D\s\S\w ."

Is this analysis helpful? If so, is there anything I else you'd like me to check?

Of those 6 properties, 2 have "optionally the same character twice", 2 have "does not start with", and the other 2 are actually non-capturing groups that I misidentified as lookarounds.

@Popcorndude Thanks for the analysis! Knowing what kinds of patterns are used how often is definitely a good thing.

The most important question however is a bit harder to answer: which constraints use patterns that can trigger catastrophic backtracking http://www.regular-expressions.info/catastrophic.html, and how hard would it be to rewrite them not to allow that? Also, how hard would it be to automatically check the expressions for this issue?

I letting things like 0?\d{8} through the filter, and most of what's left is checking file extensions. I can make them not backtrack at all if commons filenames don't contain periods (I don't know what characters are allowed). They are generally of the form .*\.(<list of extensions>)

Besides files and the 6 I mentioned before, the only ones I see are
P1949 oai:.*:.*
P274 ([αβγδφωλμπ]-)?([([]*[A-Z☐][ub]?[a-z]?[₁₂₃₄₅₆₇₈₉₀]*(\)?[¹²³⁴⁵⁶⁷⁸⁹⁰]*[⁺⁻]?)?[])|,₁₂₃₄₅₆₇₈₉₀]*(·\(?[-0-9.]*n?\)?)?)+
P367 .+ (S|s)ymbol.*\.svg
P395 ([A-Z0-9\p{L}]+(,\w)?)+
P866 (([a-z]+|39)\-)*([1-9][0-9]*|[a-z]+|[1-9][a-z0-9][a-z0-9])

everything else either doesn't have this problem (as far as I can tell) or are guaranteed to be looking for less than 10 characters.

\(?\|?([^*+(){}\[\]]|\[(\\\]|[^\]])*\]|\(([^()*+]|[+*]\|)*\)|(?!<\\)\\[+*]|(\\d|\[0-9\])([*+]|\{\d+(,\d+|)\})(?!\\d|\d|\[0-9)|\{\d+\})*([*+]|\{\d+(,\d+|)\})?\|?\)?

matches 624 of the ok ones, and should only match ok ones, though some will fail.

How about using (?>....) independent sub-expressions to avoid backtracking?

As to P1949, /^oai:.*?:.*$/ or /^oai:[^:]*:.*$/ would work (since we don't capture anyway)

We could also disallow .* that isn't anchored to the end of the input via a fixed literal. That is, /^foo.*bar$/ would be ok, but /^.*x.*$/ would not. /^.*?x.*$/ would be fine though. Have a look at Death to Dot Star!

Maybe we should spell out the pattern matching features we actually need to cover our primary use case, matching identifiers. I'd suggest:

  • Literal, e.g. abc
  • Character set, e.g. a[bc].
  • Character range, e.g. foo[0-9]
  • Repetition for an exact number of times, e.g. [ab]{2}
  • Repetition for a range of times number of times, e.g. [ab]{2,5}
  • Alternatives, e.g. ab|cd.

Expressions are always anchored (have to match the full string), and are always case sensitive.

I don't think we need escapes for special characters like \t, and I don't think we need character classes like \w. I also don't think we need inverse sets like [^bc], or infinite repetition using a+ or a* or even a{2,}.

We also don't need grouping, I think. Especially no nested groups. It might be handy when using alternatives, but it complicates things a lot.

Those criteria accept 62 (8%) of the current constraints.
Adding character classes (\d is everywhere) brings it up to 166 (23%)

I would suggest allowing infinite repetition if the thing being repeated cannot overlap with the next thing. Although to prevent that from requiring the checking regex to execute the format on itself maybe infinite repetition should be allowed on character classes if they don't overlap with the following thing. (e.g. \d*[a-z] but not \d*0), and then also treat them as atomic (automatically convert \d+ to (?>\d+))

It would probably make things a lot easier for the people making the formats if each property was allowed to have multiple format strings (match this or this).

As for grouping, maybe allow it but only non-nested atomic, since the alternative is to rewrite this (admittedly slightly extreme example) as a flat list of alternatives

https?://((academia|android|anime|apple|arduino|astronomy|aviation|beer|bicycles|biology|bitcoin|blender|boardgames|bricks|buddhism|chemistry|chess|chinese|christianity|codegolf|codereview|cogsci|cooking|craftcms|crypto|cs|cstheory|datascience|dba|diy|drupal|dsp|earthscience|ebooks|electronics|ell|emacs|english|expatriates|expressionengine|fitness|freelancing|french|gamedev|gaming|gardening|genealogy|german|gis|graphicdesign|ham|hermeneutics|hinduism|history|homebrew|islam|italian|japanese|joomla|judaism|linguistics|magento|martialarts|math|matheducators|mathematica|mechanics|meta|moderators|money|movies|music|networkengineering|opendata|outdoors|parenting|patents|pets|philosophy|photo|physics|pm|poker|politics|productivity|programmers|puzzling|quant|raspberrypi|reverseengineering|robotics|rpg|russian|salesforce|scicomp|scifi|security|sharepoint|skeptics|softwarerecs|sound|space|spanish|sports|sqa|startups|stats|sustainability|tex|tor|travel|tridion|unix|ux|video|webapps|webmasters|windowsphone|wordpress|workplace|writers)\.stackexchange\.com|askubuntu\.com|mathoverflow\.net|pt\.stackoverflow\.com|serverfault\.com|stackapps\.com|stackoverflow\.com|superuser\.com)/(tags|questions/tagged)/.*
(P1482)

I'll see if I can make some regexs for this in a bit.

@Popcorndude: I don't really know what's going on here (sorry if I've completely misunderstood), are you only looking at a subset of the format constraints? I know P1814 is using \p and P898 is using \x and you didn't mention either of those even though they're using things other than "?*+ [] () | {} {,} \d\D\s\S\w .".

My apologies. I eliminated those in my initial analysis and forgot to mention it. The full list of things with backslashes in front of them:
bdDpsSwx2()[]{}|^\/$?+*,-.

this matches the constraints I suggested:

^(?!.*?\.(\+|\*|\{\d+,\})\()(\\.|[^()\\\[\]]|\[([^\\\[\]]|\\.)*\]|\((?!\?)(\\.|[^()\\]|\[([^\\\[\]]|\\.)*\])*\))+$

these convert infinite repetition (other than .*(), .+(), and .{n,}()) to atomic groups:

(\[([^\[\]]|\\.)*\])(\+|\*|\{\d+,\})(?>\1\2)
((?<!\\)[^\[\](){}.])(\+|\*|\{\d+,\})(?>\1\2)
((?<!(\\\\)*\\)\\.)(\+|\*|\{\d+,\})(?>\1\2)
(\(([^()]|\\.)*\))(\+|\*|\{\d+,\})(?>\1\2)
\.(\+|\*|\{\d+,\})(\\.|[^()\[\]\\])(?>[^\2]\1)\2
\.(\+|\*|\{\d+,\})\[(([^\\\[\]]|\\.)*)\](?>[^\2]\1)[\2]

(incidentally, this accepts 669/715 (93%) of the current constraints)

Those criteria accept 62 (8%) of the current constraints.
Adding character classes (\d is everywhere) brings it up to 166 (23%)

Yea, I can imagine. My goal was to suggest a minimal set of patterns that would allow us to cover all the use cases, not a set that would cover most current patterns. The idea is to force people to use simpler tools when making the patterns. That prevents patterns with bad runtime behavior.

I messed with the constraints a bit, and it would be pretty easy to get up to ~50% with the constraints you outlined (the numbers I gave before may have forgotten to skip newlines, lowering the count). Adding + and * covers 3/4, and most of the rest could be rewritten without to much trouble (other than P1793 and possibly a few others that are really basically impossible).

(\\.|[^(){}\[\]\\]|\{\d+(,\d+)?\}|\[(?!^)(\\\[|\\\]|[^\[\]])*\])*

should work (though it does let character classes through).

After a bit more fiddling:

P234, P274, P305, P353, P395, P428, P529, P866, P998, P1190, P1256, P1257, P1472, P1612, P1662, P1793, P1986, P2015 are the only properties where I don't see a way to rewrite the format with your constraints plus infinite repetition (i.e. no groups), and with a lot of these there probably is a upper limit, it's just that no one bothered to write it into the format.

With that in mind, the list shortens to P234, P274, P428, P1472, P1612, P1793 (most of which are probably things you didn't intend to support anyway).

Weeks ago I played around with the current set of patterns given in

and tried to create a parser that evaluates these regex patterns. This is the best I could come up with:

/^(\(\?i\))?(([^\+*?[^\]$(){}|]|\\[dDsSwW\W]|\\u[\dA-F]{4}|\[[^\]]+\])([+*?]|\{\d+(,\d*)?\})?)+$/

  • Total number of patterns: 605
  • Unique patterns: 384
  • Only 6 of the existing patterns are compatible with fnmatch.
  • 449 patterns pass my tester above and can be considered "very simple". They neither use any grouping ((…)) nor alternatives (|) nor problematic repetitions like *?.
  • 156 patterns can be considered "complicated".
    • A few of them can be rewritten, e.g. (S|)\d{4} can be rewritten as S?\d{4}. But most of them can not, mostly because they use (…)? or (…|) or even …| (the outer "zero" group does not need round brackets) to mark a segment or the whole thing as alternative or optional. I'm not sure if this should be allowed because it opens the can of "catastrophic backtracking" worms.
    • When allowing simple grouping as in the patterns that match file name extensions (by adding \(\w+(\|\w+)*\) to the tester above) 35 additional patterns can be validated.
    • The most catastrophic fragment I found is …( [A-Z][a-z]+(-([A-Z])?[a-z]+)*)*…. These are three (!) nested repetitions. Backtracking gone insane. If we need proof that it was a good idea to disable the code that runs these regexes, this is it.

My tester above is a proof of concept. Instead of using a regex to evaluate regexes I would love to create a tiny parser in PHP.

What if backtracking, grouping, and alternatives were all disabled, and each property could have multiple format constraints, and a value would pass if it matched any of them? Most paterns would need rewriting, but only for about 18 of them would it actually be difficult.

Jonas renamed this task from Evaluate pattern constraints (safely) to [Task] Evaluate pattern constraints (safely).Nov 2 2015, 10:32 AM
Jonas set Security to None.

While reading http://www.regular-expressions.info/catastrophic.html#example I realized that disabling certain regex features is not enough to prevent "catastrophic backtracking" completely. It just makes it less likely. The reason is that backtracking always happens when a length is variable. And this can always become "catastrophic", even with a few simple stars but nothing else. Backtracking is a core feature of regular expressions. It does not make much sense to disable it, except for fixed length identifiers like ISBN-13, which can not contain less or more than 13 characters.

So the "regex parser/validator/compiler" library I proposed above can not solve this issue. It could help by adding atomic grouping to every sub-pattern with a variable length, effectively disabling backtracking. But this would also change the meaning and function of most patterns we have and make them mismatch all the wrong things.

I'm reading and thinking about this for months now and it seems there is no solution. Either have regular expressions or not.

Update: We’re currently checking the format constraint on the Wikidata Query Service, see T102752.