Subject: [PATCH] Parser: Fix quadratic regexp edge case
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
In external link syntax like `[http://example.org Example]`,
the space between link target and label is technically optional
when the label starts with characters not allowed in the URL,
such as `[http://example.org<b>Example</b>]`.
This is done with a regexp that matches a required opening bracket,
a required URL, optional spaces, optional label, and a required
closing bracket. The real regexp is messy to handle various characters
allowed in each part, but for illustration purposes, it's basically
the same as `\[([^\]\s\<]+) *([^\]\s]*?)\]`.
When given input that looks like a link, but doesn't have the closing
bracket, the regexp engine (PCRE) would therefore attempt matching
every possible combination of target and label lengths before failing:
Input: [http://example.org
| Target | Label |
| ------------------ | ----- |
| http://example.or | g |
| http://example.o | r |
| http://example.o | rg |
| http://example. | o |
| http://example. | or |
| http://example. | org |
| http://example | . |
| http://example | .o |
| http://example | .or |
| http://example | .org |
| http://exampl | e |
| http://exampl | e. |
| http://exampl | e.o |
| http://exampl | e.or |
| http://exampl | e.org |
…and so on. This would take (1 + 2 + 3 + … + 18) = 171 steps to fail
in this example, or `N * (N+1) / 2` steps in general. For sufficiently
large inputs this hits a limit designed to protect against exactly
this situation, and the whole wikitext parser crashes.
(To hit the pathological case, it's also required for a `]` to appear
somewhere later in the input, otherwise PCRE would detect that a match
is never possible and exit before doing any of the above.)
Live example: https://regex101.com/debugger/?regex=%5C%5B%28%5B%5E%5C%5D%5Cs%5C%3C%5D%2B%29%20%2A%28%5B%5E%5C%5D%5Cs%5D%2A%3F%29%5C%5D&testString=%5Bhttp%3A%2F%2Fexample.org%0A%5D
We can fix it by changing the lazy quantifier `*?` to the greedy `*`.
This is correct for this regexp only because the label isn't allowed
to contain ']' (otherwise, the first external link on the page would
consume all of the content until the last external link as its label).
This allows PCRE to only consider the cases where the label has the
maximum possible length:
| Target | Label |
| ------------------ | ----- |
| http://example.or | g |
| http://example.o | rg |
| http://example. | org |
| http://example | .org |
| http://exampl | e.org |
…and so on. Only 18 steps, or `N` steps in general.
Live example: https://regex101.com/debugger/?regex=%5C%5B%28%5B%5E%5C%5D%5Cs%5C%3C%5D%2B%29%20%2A%28%5B%5E%5C%5D%5Cs%5D%2A%29%5C%5D&testString=%5Bhttp%3A%2F%2Fexample.org%0A%5D
I think this bug has been present since 2004, when external link